This is an archive of the discontinued LLVM Phabricator instance.

[DataLayout] Introduce DataLayout::getPointerIntegralSize(AS)
Needs ReviewPublic

Authored by arichardson on Oct 4 2022, 7:32 AM.

Details

Summary

This function can be used to retrieve the number of bits that can be used
for arithmetic in a given address space (i.e. the range of the address
space). For most (all?) in-tree targets this should not make any difference,
but differentiating between the size of a pointer in bits and the address
range is extremely important e.g. for CHERI-enabled targets, where pointers
carry additional metadata such as bounds and permissions and only a subset
of the pointer bits is used as the address. This could also benefit other
users of non-integral pointers but I am not familiar with any of those
backends. In the out-of-tree CHERI target, we use the index width of the
datalayout to identify the address range (and I believe this was also
the intent when the index type was introduced in D42123).
This commit is just a clarification of the LangRef as well as
the introduction of new functions that more clearly explain the
intended usage. In the future, if we end up supporting a target
with index width != integral range, we could add a new datalayout
component.

I am not sure if getPointerIntegralSize() is the best name for this
accessor, I also considered names such as getPointerArithmenticRange or
getPointerAddressRange().

Diff Detail

Event Timeline

arichardson created this revision.Oct 4 2022, 7:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 7:32 AM
arichardson requested review of this revision.Oct 4 2022, 7:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 7:32 AM
nikic added a comment.Oct 4 2022, 7:43 AM

I don't really understand what the distinction between the index size and the integral size is supposed to be. Can you please give some examples where these two quantities would differ?

I would be extremely vary of adding a third dimension to pointer sizes. The index size is already enough of a burden to deal with, and we've only recently approached something resembling correct use of index sizes.

I don't really understand what the distinction between the index size and the integral size is supposed to be. Can you please give some examples where these two quantities would differ?

I would be extremely vary of adding a third dimension to pointer sizes. The index size is already enough of a burden to deal with, and we've only recently approached something resembling correct use of index sizes.

Ah, in that case I would be more than happy to drop this distinction and just clarify the datalayout that the index size is the size of the address (which is what we do for CHERI already).
I added this new type based on your comment in D99660:

The index type is used for GEP index and casting a pointer to the index type doesn't make a whole lot of sense to me. Do you have any LangRef wording or other usages that would show that this is a sensible thing to do?

Would a clarification to the LangRef be sufficient?

Basically, the problem I'm trying to solve here is that code introducing ptr2int right now uses DL.getPointerSize() to obtain the integer, but for us we want this to be the address size (i.e. 64 for ELFCLASS64).
We don't want ptr2int to give us i128 types since only 64 of those bits are meaningful (but using ptrtoint is perfectly fine, so the current non-integral constraints are too restrictive)

Avoid adding a new datalayout component and only add the function

arichardson edited the summary of this revision. (Show Details)Oct 4 2022, 8:13 AM
arsenm added a subscriber: arsenm.Oct 4 2022, 8:57 AM

But I thought we banned ptrtoint for nonintegral pointers

jrtc27 added a comment.Oct 4 2022, 9:03 AM

CHERI capabilities aren't non-integral. Converting a capability to an integer gives you the address, discarding the metadata (which is also stable*), which is as well-defined as it is on normal architectures. Non-integral pointers are weird unstable things where ptrtoint on the same thing can give different results and so you can't introduce new instructions during optimisation (but so long as you know what you're doing it's fine to create them in the first place in the frontend).

  • Technically not true when revocation is involved if you inspect the metadata of a capability referring to an object past the end of its lifetime, but that is highly UB and constrained in terms of what you can observe
nikic added a comment.Oct 4 2022, 12:54 PM

Is this needed for anything but the one usage in D99660? We could just stop using ptrtoint there and just explicitly check for the nullptr and inttoptr cases, which is all this handles in practice. This overly general constant expression based code has already caused enough complications in the past, so I'm happy to drop it.

I don't think producing a ptr2int out of thin air (without a known result type) is common, and should probably be avoided in general. The main case where we currently introduce ptr2int is when type punning through memory, and in that case the load/store type determines the used type.

Is this needed for anything but the one usage in D99660? We could just stop using ptrtoint there and just explicitly check for the nullptr and inttoptr cases, which is all this handles in practice. This overly general constant expression based code has already caused enough complications in the past, so I'm happy to drop it.

I don't think producing a ptr2int out of thin air (without a known result type) is common, and should probably be avoided in general. The main case where we currently introduce ptr2int is when type punning through memory, and in that case the load/store type determines the used type.

There are also a few other cases where code wants needs the address range instead of the pointer size but that does not matter for most targets. I'll upload a few more patches that show where this function is needed for CHERI.

CHERI capabilities aren't non-integral. Converting a capability to an integer gives you the address, discarding the metadata (which is also stable*), which is as well-defined as it is on normal architectures.

Note that this is purely for historical reasons: Non-integral pointers were not available when we started. I worked with the folks pushing NI pointers to make sure that they'd work with CHERI and the plan was always to move over to NI at some point.

I overloaded the semantics of ptrtoint as get-address as a quick hack to make things work but it's definitely not the right thing to do and we spent several years working with other folks with similar requirements to upstream features that would let us move away from this. The behaviour of CHERI's ptrtoint is confusing to optimisers. I believe that the correct solution (which, I vaguely recall, I wrote in a roadmap doc before I left the CL) is:

  1. Make CHERI use NI pointers.
  2. Make clang generate get-address intrinsics instead of ptrtoint.
  3. Help fix any optimisations that try to introduce ptrtoint for NI pointers.
  4. Make the CHERI back ends reject PTRTOINT DAG nodes.

CHERI LLVM IR should only ever have get-address and set-address intrinsics, no inttoptr and ptrtoint instructions. This should improve optimisations for things that round-trip addresses through pointers because ptrtoint is treated as an escape for alias analysis, whereas get-address is not and set-address is explicitly a provenance-carrying operation that works directly with alias analysis. This structure is also desirable for Rust's Strict Provenance model, even with non-CHERI targets.

This is also the approach that I've recommended to the embecosm folks working on preparing CHERI LLVM for upstreaming. It would be great if more people could join those calls.

jrtc27 added a comment.Oct 5 2022, 7:05 AM

Except then you can't do unsigned long x = (unsigned long)&y as intrinsics are not constant expressions. Non-integral pointers are too strict for CHERI as things stand, we want a subset of their behaviour.

Except then you can't do unsigned long x = (unsigned long)&y as intrinsics are not constant expressions. Non-integral pointers are too strict for CHERI as things stand, we want a subset of their behaviour.

You can; however, do it as a constant address-space cast, followed by a ptrtoint, which would make lowering easier in the back end because a constant address space cast of a global should give a non-capability relocation.

mkazantsev resigned from this revision.Dec 7 2022, 12:02 AM

Not sure I can give any useful feedback here.