diff --git a/llvm/docs/Reference.rst b/llvm/docs/Reference.rst --- a/llvm/docs/Reference.rst +++ b/llvm/docs/Reference.rst @@ -44,6 +44,7 @@ SpeculativeLoadHardening Statepoints SystemLibrary + TBAA TestingGuide TransformMetadata TypeMetadata @@ -216,3 +217,6 @@ :doc:`YamlIO` A reference guide for using LLVM's YAML I/O library. + +:doc:`TBAA` + A more complete description of LLVM's Type-Based Alias Analysis. diff --git a/llvm/docs/TBAA.rst b/llvm/docs/TBAA.rst new file mode 100644 --- /dev/null +++ b/llvm/docs/TBAA.rst @@ -0,0 +1,325 @@ +========================= +Type-Based Alias Analysis +========================= + +.. contents:: + :local: + +Introduction +============ + +Type-Based Alias Analysis, often abbreviated as TBAA, is an +:doc:`alias analysis ` in LLVM that implements the strict +aliasing rules of C/C++. The LLVM type system does not directly correspond to C +or C++ types (let alone other frontends), so all of the necessary information +about source-language types is encoded as metadata, predominantly the ``!tbaa`` +metadata. This document explains how TBAA works in LLVM, how it is encoded, and +how to reason about it correctly. + +Basics +====== + +TBAA consists of a few different types of metadata: scalar nodes, struct nodes, +and access tags. + +All TBAA metadata shares a particular root node, which has an optional string +indicating the TBAA hierarchy it is representing. Any TBAA queries between two +memory locations that resolve to different root nodes return a ``MayAlias`` +result: this allows multiple source languages that have different semantics for +alias analysis to coexist in the same LLVM file (for example, during link-time +optimization) without causing issues. + +Scalar nodes +============ + +Scalar nodes in TBAA represent a possible set of memory locations that the +pointer could point to. These scalar nodes form a tree based on a subset +relationship. If two pointers point to different scalar TBAA nodes, then the +result may be ``MayAlias`` if one of the nodes is an ancestor of the other one, +and ``NoAlias`` otherwise. + +The following is an example of a TBAA scalar node tree: + +.. image:: ./tbaa-scalar-tree.svg + +In the example tree, there are 6 different kinds of scalar nodes in the TBAA +tree. In this case, a pointer pointing to ``C`` may alias with a pointer +pointing to ``A``, ``C``, ``E``, or ``F``, but may not alias with a pointer +pointing to ``B`` or ``D``. + +Representation +-------------- + +Scalar nodes in LLVM are represented as ``MDNode`` values with two or three +operands. The first operand is an ``MDString`` that indicates the type the +scalar node represents. This string has almost no semantic meaning, with a few +exceptions: + +- If two modules are linked together, metadata using the same strings may be + uniqued onto each other. This allows modules to use the same TBAA tree in + situations such as LTO. + +- The exact string ``"vtable pointer"`` indicates that the pointer is an access + to a vtable pointer and is eligible for the devirtualization optimization. + +The second operand of a scalar node is the metadata node of its parent node. If +there is no parent, then it is the metadata of the TBAA root node. (Note that +TBAA root nodes are not considered scalar nodes). + +The third operand of a scalar node is optional. If present, it must be a +constant integer ``0``. + +Thus, the example scalar node tree in the previous section would be written in +LLVM as: + +.. code-block:: llvm + + !0 = !{!"TBAA root"} ; The root node, not a scalar node. + !1 = !{!"A", !0, 0} ; The third operand of a scalar node is optional. + !2 = !{!"B", !1, 0} + !3 = !{!"C", !1, 0} + !4 = !{!"D", !1, 0} + !5 = !{!"E", !3, 0} + !6 = !{!"F", !5, 0} + +Struct nodes +============ + +Struct TBAA nodes indicate the layout of fields within structs and unions. These +fields are tuples of type information and byte offsets, where the type +information points to another scalar or struct TBAA node sharing the same TBAA +root. Additionally, struct TBAA nodes have a string name which, like other +string names in TBAA, carries no inherent semantic value other than its role in +uniqueing metadata in linking multiple LLVM modules together. + +Conceptually, fields in struct nodes also have a size. This size is not +explicitly stored in the metadata, but is implicitly computed from the offsets: +each field is assumed to extend from its offset to the offset of the next +field. This allows an offset within a struct node to be perfectly resolved to a +single field. + +It is possible to nest structs, so that the type of the field within a struct +TBAA node is itself a struct TBAA node. + +Representation +-------------- + +Struct TBAA nodes are represented in metadata as tuples of 2N + 1 entries, where +N is the number of fields in the struct. + +The first entry is an ``MDString`` containing the name of the struct type. + +Subsequent entries comprise a list of tuples of two operands each, representing +data for each field. The first operand in each tuple is a ``MDNode`` indicating +the type of the field. The second operand in each tuple is a constant integer +indicating the offset within the struct of said field. + +Fields must be ordered in increasing offset. Additionally, the type of every +field must ultimately share the same TBAA root node. + +Consider the following C/C++ type layout (recall that in C/C++, ``char`` is +legally able to alias any type, whereas types are not otherwise generally +aliasable): + +.. code-block:: c + + struct Inner { + int a; + char b; + }; + + struct Outer { + double c; + struct Inner i; + }; + + union U { + Outer o; + float f; + }; + +For these types, the generated TBAA metadata could look as follows: + +.. code-block:: llvm + + !0 = !{!"TBAA root"} + !1 = !{!"char", !0} + !2 = !{!"int", !1} + !3 = !{!"double", !1} + !4 = !{!"float", !1} + !5 = !{!"Inner", !2, 0, !1, 4} + !6 = !{!"Outer", !3, 0, !5, 8} + ; No type data is generated for the union U. + ; Instead, U will be accessed only via !1 (which can alias anything). + +Access tags +=========== + +Access tags represent the actual metadata that is attached to LLVM instructions +using the ``!tbaa`` metadata type. They may be found on any instruction that +reads or writes memory: ``load``, ``store``, ``call`` or ``invoke`` (if they +are calling an intrinsic), ``cmpxchg``, ``atomicrmw``, and ``va_arg``. + +These tags consist of three components: an access type, a base type, and an +offset. The access type indicates the scalar TBAA type of the value being read +or written by the instruction. The base type represents the outermost struct +type containing the field, while the offset represents the offset of the field +within said struct. The base type may point to a scalar TBAA node instead of a +struct node, in which case the offset must be 0. + +The access type of an access tag is required to be a scalar TBAA node, not a +struct TBAA node. Furthermore, it must either be the type of the field at the +appropriate offset within the struct the base type (or the base type itself, +should it be a scalar TBAA node), recursing through inner structs as needed, or +some ancestor type of the resulting scalar type. + +In addition to these core compnents, there is an optional boolean flag +indicating whether or not the memory location this tag points to can ever be +changed (this corresponds to what the results of the ``pointsToConstantMemory`` +of ``AliasAnalysis`` should refer to, see +:doc:`alias analysis documentation ` for more details). If this +is not present, it is defaulted to ``false``. + +Representation +-------------- + +Access tags are stored as a ``MDNode`` with three or four operands. The first +operand is the ``MDNode`` pointing to the base type of the access tag. The +second operand is the ``MDNode`` pointing to the access type. The third operand +is a constant integer indicating the offset. The fourth operand, if present, +must be a constant integer of value ``0`` or ``1``, indicating whether the +location being accessed is constant, under the definition of +``pointsToConstantMemory``. If not present, the fourth operand defaults +to ``0``. + +Using the type definitions given above, and the following function: + +.. code-block:: c + + void f(struct Outer *o, struct Inner *i, U *u, double *d) { + outer->c = 0.0; + outer->i.a = 0; + outer->i.b = 0; + inner->b = 1; + *d = 0.0; + u->f = 0.0; + u->o->i.b = 2; + } + +The following LLVM IR could be generated (reusing the same TBAA type node +metadata from the previous example): + +.. code-block:: llvm + + type %inner = { i32, i8 } + type %outer = { double, %inner } + + define void @f(ptr %o, ptr %i, ptr %u, ptr %d) { + %o.c = getelementptr %outer, ptr %o, i32 0, i32 0 + store double 0.0, ptr %o.c, !tbaa !10 + %o.i.a = getelementptr %outer, ptr %o, i32 0, i32 1, i32 0 + store i32 0, ptr %o.i.a, !tbaa !11 + %o.i.b = getelementptr %outer, ptr %o, i32 0, i32 1, i32 1 + store i8 0, ptr %o.i.b, !tbaa !12 + %i.b = getelementptr %inner, ptr %i, i32 0, i32 1 + store i8 1, ptr %i.b, !tbaa !13 + store double 0.0, ptr %d, !tbaa !14 + store double 0.0, ptr %u, !tbaa !15 + %u.o.i.b = getelementptr %outer, ptr %u, i32 0, i32 1, i32 1 + store i8 2, ptr %u.o.i.b, !tbaa !16 + } + + !10 = !{!3, !6, i64 0} + !11 = !{!2, !6, i64 8} + !12 = !{!1, !6, i64 12} + !13 = !{!1, !5, i64 4} + !14 = !{!3, !3, i64 0} + !15 = !{!1, !1, i64 0} + !16 = !{!1, !1, i64 0} + +Computing alias information +=========================== + +This section describes a conceptual view of how TBAA infers the ``MayAlias`` or +``NoAlias`` result for a query. It does not describe the exact algorithm used in +the code, instead adding extra cases to the analysis to make it easier to +understand when two variables may or may not alias each other when TBAA is +involved. + +If the two access tags are not part of the same TBAA type tree, then return +``MayAlias``. This includes the case that one or the other access does not have +any ``!tbaa`` metadata attached to it. + +If the access type of one access tag is not the ancestor of the access type of +the other access tag, then return ``NoAlias``. + +If the base type of one or both access tags is a scalar type, return +``MayAlias``. + +At this point, we can now distinguish the basic rule in C/C++ that ``int`` and +``float`` variables may not alias each other, as well as the rule that ``char`` +may be used to access any type. The remaining steps are to resolve cases where +we can identify that accesses to different fields of the same struct type may +not alias each other. + +For a struct type, we can identify the field that contains an offset. We can +resolve a type and offset pair by returning said field type and the offset +adjusted by the field's offset. This can continue onto scalar types by returning +the parent type with the offset unchanged (it must be 0, in any case, for the +access tag definition to be legal). By iterating this process, we can produce a +sequence of access tags. + +If the sequence of access tags for one access tag generated contains the base +type of the access tag, return ``MayAlias`` if they share the same offset and +``NoAlias`` if they do not. + +If no result has been computed by this point, return ``NoAlias``. + +Detailed example +================ + +By example, consider the following C code. The access tags for the stores, as +well as the sequence of access tags generated for lookup, is elaborated: + +.. code-block:: c + + struct A { + int f1; + int f2; + }; + + struct B { + struct A a1; + int f3; + struct A a2; + }; + + void f(struct B* b, struct A* a, int *p, float *q) { + b->a1->f1 = 1; // access tag: (B, int, 0) + // sequence: (A, int, 0), (int, int, 0) + b->a1->f2 = 2; // access tag: (B, int, 4) + // sequence: (A, int, 4), (int, int, 0) + a->f2 = 3; // access tag: (A, int, 4) + // sequence: (int, int, 0) + b->f3 = 4; // access tag: (B, int, 8) + // sequence: (int, int, 0) + b->a2->f1 = 5; // access tag: (B, int, 12) + // sequence: (A, int, 0), (int, int, 0) + b->a2->f2 = 6; // access tag: (B, int, 16) + // sequence: (A, int, 4), (int, int, 0) + *p = 7; // access tag: (int, int, 0) + *q = 8.0; // access tag: (float, float, 0) + } + +In this example, the store to ``q`` can be determined not to alias any of the +other stores, because the access type is different from all of the other tags. +Meanwhile, the store to ``p`` may alias any of the other stores (except for +``q``), as it is a scalar access type and the access type is the same as the +other access types. + +The stores of 1, 2, 4, 5, 6 are all mutually non-aliasing, as they share the +same base type ``B``, but each access a distinct offset within that type. + +The store of 3 may alias 2 and 6, as it shares the offset of 4 when ``A`` +appears in the sequence of access tags. It may not alias 5, as 5 has a different +offset when ``A`` appears in its sequence of access tags. diff --git a/llvm/docs/tbaa-scalar-tree.svg b/llvm/docs/tbaa-scalar-tree.svg new file mode 100644 --- /dev/null +++ b/llvm/docs/tbaa-scalar-tree.svg @@ -0,0 +1,91 @@ + + + + + + +G + + + +TBAA root + +TBAA root + + + +A + +A + + + +A->TBAA root + + + + + +B + +B + + + +B->A + + + + + +C + +C + + + +C->A + + + + + +D + +D + + + +D->A + + + + + +E + +E + + + +E->C + + + + + +F + +F + + + +F->E + + + + +