Page MenuHomePhabricator

[LangRef] Clarify GEP inbounds wrapping semantics
ClosedPublic

Authored by nikic on Tue, Nov 3, 12:21 PM.

Details

Summary

The current LangRef wording for GEP inbounds is not very clear in what it implies in terms of wrapping restrictions for the underlying additions and multiplications. This updates LangRef with my understanding of how this is currently being interpreted in practice.

For example, there's this SCEV comment: https://github.com/llvm/llvm-project/blob/7f34aca083b528db1d880b406f1a1953eeb6aa95/llvm/lib/Analysis/ScalarEvolution.cpp#L5061-L5066

Another indication is the fact that clang emits pointer comparisons using unsigned predicates, so we must be assuming that allocated objects cannot wrap the unsigned address space. (For address space 0 this is automatically given because there may be no allocated object containing the null pointer.)

However, I do wonder if we could get away with saying that no allocated object may wrap the signed address space, in which case inbounds would just map cleanly to nsw, which would both allow more optimization and be easier to reason about. We have at least one important target (x86-64) where this is a given by hardware constraints. But I don't know if we can get away with making that a general limitation. (I don't think it's a good idea to make core GEP semantics target-dependent.)

Diff Detail

Event Timeline

nikic created this revision.Tue, Nov 3, 12:21 PM
nikic requested review of this revision.Tue, Nov 3, 12:21 PM
uenoku added a subscriber: uenoku.Tue, Nov 3, 12:54 PM

LGTM, an example of a gep with inbounds wouldbe nice. there are some test cases like: CodeGen/2009-02-13-zerosize-union-field.c, CodeGen/2010-07-14-ref-off-end.c etc from where we can take a reduced example for the doc.

aqjune added a comment.Tue, Nov 3, 6:18 PM

However, I do wonder if we could get away with saying that no allocated object may wrap the signed address space, in which case inbounds would just map cleanly to nsw, which would both allow more optimization and be easier to reason about. We have at least one important target (x86-64) where this is a given by hardware constraints. But I don't know if we can get away with making that a general limitation. (I don't think it's a good idea to make core GEP semantics target-dependent.)

It seems sbrk() assume an address space is unsigned:
https://code.woboq.org/userspace/glibc/misc/sbrk.c.html

Since it is frequently used to implement malloc(), I guess malloc can in general return an allocation whose address is something like [0x7fffffff, 0x80000001]. Does this make sense?

nlopes added a comment.Wed, Nov 4, 1:03 AM

LGTM modulo the two comments. Thanks for writing this down!

llvm/docs/LangRef.rst
9798

I think the "one byte past" can be misleading. A pointer can point to the end of an object, or as the C++ standard puts it, it may point to a hypothetical next element n if the object has n elements.
http://eel.is/c++draft/basic.compound#3.4

9808

one more case is that offsets must fit (signed-wise) in the address space's size. So a GEP with 64 bits offsets on a 32-bit address space is OK as long as those offsets fit in 32 bits.

nikic updated this revision to Diff 302941.Wed, Nov 4, 12:34 PM
  • Try to use clearer wording for the address to the end of the object.
  • Mention that the truncation to the DL index type must preserve the signed value.
  • Merge the two bullet lists. I think the nsw requirement for the offset addition needs to be listed as an explicit requirement, it's not a direct consequence of the other rules.
  • Mention that these rules assume that no allocated object wraps the unsigned address space, and no object is larger than half the address space.
nikic updated this revision to Diff 302965.Wed, Nov 4, 1:51 PM

Clarify that the nsw offset wrapping restrictions are relative to the pointer index type (which may have smaller size than the pointer).

This makes sense to me.

However, I do wonder if we could get away with saying that no allocated object may wrap the signed address space, in which case inbounds would just map cleanly to nsw, which would both allow more optimization and be easier to reason about.

Unfortunately, that wouldn't work for us. We have 32-bit address spaces that are treated as the unsigned range [0, 0xffffffff], and there's no limitation as to where objects can appear in that range.

llvm/docs/LangRef.rst
9811–9812

As a second corollary, the addition wraps in an unsigned sense if and only if the added offset is negative?

nikic added a comment.Thu, Nov 5, 9:49 AM

This makes sense to me.

However, I do wonder if we could get away with saying that no allocated object may wrap the signed address space, in which case inbounds would just map cleanly to nsw, which would both allow more optimization and be easier to reason about.

Unfortunately, that wouldn't work for us. We have 32-bit address spaces that are treated as the unsigned range [0, 0xffffffff], and there's no limitation as to where objects can appear in that range.

Yeah, doesn't look like we can require this in the general case. I also checked that malloc does indeed return allocations that cross the sign boundary for -m32 binaries, using the following code:

#include <stdio.h>
#include <stdlib.h>
#define BLOCK_SIZE (128 * 1024 * 1024)
int main() {
    while (1) {
        char *ptr = malloc(BLOCK_SIZE);
        if (!ptr) {
            printf("OOM!\n");
            return 0;
        }
        printf("%p - %p\n", ptr, ptr + BLOCK_SIZE);
    }
}

So if we wanted to make use of the x86-64 address space layout, we'd have to do that based on a datalayout property or so, and that's out of scope of this change (and likely generally not worthwhile).

llvm/docs/LangRef.rst
9811–9812

I agree that this statement is true, but is it also useful for something? Not sure in what way optimizations would make use of the fact that the calculation always wraps.

jrtc27 added a subscriber: jrtc27.Thu, Nov 5, 1:19 PM

It seems sbrk() assume an address space is unsigned:
https://code.woboq.org/userspace/glibc/misc/sbrk.c.html

Since it is frequently used to implement malloc(), I guess malloc can in general return an allocation whose address is something like [0x7fffffff, 0x80000001]. Does this make sense?

It's also an ancient deprecated interface, rather dodgy in terms of pointer provenance (sbrk returns the _old_ program break), and does not exist on some modern OS ports (notably, FreeBSD/arm64 and FreeBSD/riscv do not provide it). Whilst some best-of-1990s allocators that make use of it are still rather more pervasive than they should be, it really needs to be consigned to the history books...

aqjune added a comment.Thu, Nov 5, 7:40 PM

It seems sbrk() assume an address space is unsigned:
https://code.woboq.org/userspace/glibc/misc/sbrk.c.html

Since it is frequently used to implement malloc(), I guess malloc can in general return an allocation whose address is something like [0x7fffffff, 0x80000001]. Does this make sense?

It's also an ancient deprecated interface, rather dodgy in terms of pointer provenance (sbrk returns the _old_ program break), and does not exist on some modern OS ports (notably, FreeBSD/arm64 and FreeBSD/riscv do not provide it). Whilst some best-of-1990s allocators that make use of it are still rather more pervasive than they should be, it really needs to be consigned to the history books...

Wow, thanks for the info! :) It seems the man page (https://www.freebsd.org/cgi/man.cgi?query=sbrk&sektion=2 ) is saying the same thing as well.

nikic marked 2 inline comments as done.Mon, Nov 9, 9:10 AM

Any further feedback here?

jrtc27 added inline comments.Mon, Nov 9, 9:12 AM
llvm/docs/LangRef.rst
9797

Minor tweak to not assume that pointers are integers

nikic updated this revision to Diff 303960.Mon, Nov 9, 12:33 PM

Apply suggestion by jrtc27.

aqjune accepted this revision.Tue, Nov 10, 9:50 PM

I like the change, so LGTM

This revision is now accepted and ready to land.Tue, Nov 10, 9:50 PM

(..but one more LGTM might be better to make things sure)

nlopes added inline comments.Wed, Nov 11, 11:41 AM
llvm/docs/LangRef.rst
9799

I still don't like the current writing. I would need to see some evidence from language standards that they require pointers past the end of objects.

9802

than than

9806

It's a bit stronger than that. The addition of each offset to the preceding pointer should not overflow. You can't do e.g.:
gep inbounds %p, -1, 1

because %p-1 is OOB, even though the result is in bounds (because %p must be in bounds).

jrtc27 added inline comments.Wed, Nov 11, 11:51 AM
llvm/docs/LangRef.rst
9799

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

https://port70.net/~nsz/c/c11/n1570.html#6.5.6p8

9806

It's more nuanced than that, no? %p could be a pointer part-way through (or one past the end of) an object, in which case %p-1 would still be in bounds?

nikic added inline comments.Wed, Nov 11, 12:00 PM
llvm/docs/LangRef.rst
9799

What would be a better wording? "One past the end" is a term of art, and as such should be well understood: https://www.google.com/search?q=one+past+the+end

9806

This is specified in the next bullet point (successive addition to the base pointer must remain in bounds of the allocated object).

nlopes added inline comments.Wed, Nov 11, 12:20 PM
llvm/docs/LangRef.rst
9799

Thanks for the reference. Though that paragraph doesn't say that a pointer 1 byte past the end is valid.
It says that the following is valid:
int x[n]
q = p+(n-1); points to the last element
q = p+1;
points to one element past the last

Doesn't say that (char*)(p+n)+1 is valid, which is what it means for a pointer 1 byte past the end to be valid.

So AFAICT, both the C & C++ standards agree that p+n is the max one needs to support.

My suggestion is simply to remove the part in parenthesis "(which is one byte past the last byte contained in the object)". Or replace it with similar wording of the C++ standard (corresponds to a hypothetical next element or something like that).

9806

Ok, right. Then this is more of a corollary of the point below. Sounds correct at least. I'm happy to keep it.

jrtc27 added inline comments.Wed, Nov 11, 1:03 PM
llvm/docs/LangRef.rst
9799

Assuming

q = p+1; /* points to one element past the last */

was meant to be

q = p+n; /* points to one element past the last */

It depends whether you define end as being p + n or (char *)(p + n) - 1. C/C++ use the latter (as do people when they talk about "one past the end" pointers), whereas you seem to be using the former. To C/C++, (char*)(p+n)+1 would be OOB as it's one byte after one past the last element.

So I think we are on the same page in terms of semantics, we just have different ideas of what certain terms mean.

nikic updated this revision to Diff 304876.Thu, Nov 12, 9:45 AM
  • Remove parenthetical about end.
  • Fix "than than" typo.
nikic marked an inline comment as done.Thu, Nov 12, 9:46 AM
nikic added inline comments.
llvm/docs/LangRef.rst
9799

Okay, I've dropped the part in the parentheses.

LGTM!
Thanks a lot for working on this!

This revision was automatically updated to reflect the committed changes.