Index: docs/AMDGPUUsage.rst =================================================================== --- docs/AMDGPUUsage.rst +++ docs/AMDGPUUsage.rst @@ -1,109 +1,3436 @@ -============================== -User Guide for AMDGPU Back-end -============================== +============================= +User Guide for AMDGPU Backend +============================= + +.. contents:: + :local: Introduction ============ -The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with -the R600 family up until the current Volcanic Islands (GCN Gen 3). +The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the +R600 family up until the current GCN families. It lives in the +``lib/Target/AMDGPU`` directory. -Refer to `AMDGPU section in Architecture & Platform Information for Compiler Writers `_ -for additional documentation. +LLVM +==== -Conventions -=========== +.. _amdgpu-target-triples: + +Target Triples +-------------- + +Use the ``clang -target ---`` option to +specify the target triple: + + .. table:: AMDGPU Target Triples + :name: amdgpu-target-triples-table + + ============ ======== ========= =========== + Architecture Vendor OS Environment + ============ ======== ========= =========== + r600 amd + amdgcn amd + amdgcn amd amdhsa + amdgcn amd amdhsa opencl + amdgcn amd amdhsa amdgizcl + amdgcn amd amdhsa amdgiz + amdgcn amd amdhsa hcc + ============ ======== ========= =========== + +``r600-amd--`` + Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on + the MESA runtime. + +``amdgcn-amd--`` + Supports AMD GPUs GCN 6 onwards for graphics and compute shaders executed on + the MESA runtime. + +``amdgcn-amd-amdhsa-`` + Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_ + compatible runtimes such as AMD's ROCm [AMD-ROCm]_. + +``amdgcn-amd-amdhsa-opencl`` + Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA + [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See + :ref:`amdgpu-opencl`. + +``amdgcn-amd-amdhsa-amdgizcl`` + Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping + is used (see :ref:`amdgpu-address-spaces`). + +``amdgcn-amd-amdhsa-amdgiz`` + Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is + used (see :ref:`amdgpu-address-spaces`). + +``amdgcn-amd-amdhsa-hcc`` + Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels + executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See + :ref:`amdgpu-hcc`. + +.. _amdgpu-processors: + +Processors +---------- + +Use the ``clang -mcpu `` option to specify the AMD GPU processor. The +names from both the *Processor* and *Alternative Processor* can be used. + + .. table:: AMDGPU Processors + :name: amdgpu-processors-table + + ========== =========== ============ ===== ======= ================== + Processor Alternative Target dGPU/ Runtime Example + Processor Triple APU Support Products + Architecture + ========== =========== ============ ===== ======= ================== + **R600** [AMD-R6xx]_ + -------------------------------------------------------------------- + r600 r600 dGPU + r630 r600 dGPU + rs880 r600 dGPU + rv670 r600 dGPU + **R700** [AMD-R7xx]_ + -------------------------------------------------------------------- + rv710 r600 dGPU + rv730 r600 dGPU + rv770 r600 dGPU + **Evergreen** [AMD-Evergreen]_ + -------------------------------------------------------------------- + cedar r600 dGPU + redwood r600 dGPU + sumo r600 dGPU + juniper r600 dGPU + cypress r600 dGPU + **Northern Islands** [AMD-Cayman-Trinity]_ + -------------------------------------------------------------------- + barts r600 dGPU + turks r600 dGPU + caicos r600 dGPU + cayman r600 dGPU + **GCN GFX6 (Southern Islands (SI))** [AMD-Souther-Islands]_ + -------------------------------------------------------------------- + gfx600 - SI amdgcn dGPU + - tahiti + gfx601 - pitcairn amdgcn dGPU + - verde + - oland + - hainan + **GCN GFX7 (Sea Islands (CI))** [AMD-Sea-Islands]_ + -------------------------------------------------------------------- + gfx700 - bonaire amdgcn dGPU - Radeon HD 7790 + - Radeon HD 8770 + - R7 260 + - R7 260X + \ - kaveri amdgcn APU - A6-7000 + - A6 Pro-7050B + - A8-7100 + - A8 Pro-7150B + - A10-7300 + - A10 Pro-7350B + - FX-7500 + - A8-7200P + - A10-7400P + - FX-7600P + gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100 + - FirePro W9100 + - FirePro S9150 + - FirePro S9170 + \ dGPU ROCm - Radeon R9 290 + - Radeon R9 290x + - Radeon R390 + - Radeonb R390x + gfx702 - kabini amdgcn APU - E1-2100 + - mullins - E1-2200 + - E1-2500 + - E2-3000 + - E2-3800 + - A4-5000 + - A4-5100 + - A6-5200 + - A4 Pro-3340B + **GCN GFX8 (Volcanic Islands (VI))** [AMD-Volcanic-Islands]_ + -------------------------------------------------------------------- + gfx800 - iceland amdgcn dGPU - FirePro S7150 + - FirePro S7100 + - FirePro W7100 + - Radeon R285 + - Radeon R9 380 + - Radeon R9 385 + - Mobile FirePro + M7170 + gfx801 - carrizo amdgcn APU - A6-8500P + - Pro A6-8500B + - A8-8600P + - Pro A8-8600B + - FX-8800P + - Pro A12-8800B + \ amdgcn APU ROCm - A10-8700P + - Pro A10-8700B + - A10-8780P + \ amdgcn APU - A10-9600P + - A10-9630P + - A12-9700P + - A12-9730P + - FX-9800P + - FX-9830P + \ amdgcn APU - E2-9010 + - A6-9210 + - A9-9410 + gfx802 - tonga amdgcn dGPU ROCm Same as gfx800 + gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano + - Radeon R9 Fury + - Radeon R9 FuryX + - Radeon Pro Duo + - FirePro S9300x2 + \ - polaris10 amdgcn dGPU ROCm - Radeon RX 470 + - Radeon RX 480 + \ - polaris11 amdgcn dGPU ROCm - Radeon RX 460 + gfx804 amdgcn dGPU Same as gfx803 + gfx810 - stoney amdgcn APU + **GCN GFX9** + -------------------------------------------------------------------- + gfx900 amdgcn dGPU - FirePro W9500 + - FirePro S9500 + - FirePro S9500x2 + gfx901 amdgcn dGPU ROCm Same as gfx900 + except XNACK is + enabled + gfx902 amdgcn APU *TBA* + + .. TODO + Add product + names. + gfx903 amdgcn APU Same as gfx902 + except XNACK is + enabled + ========== =========== ============ ===== ======= ================== + +.. _amdgpu-address-spaces: Address Spaces -------------- -The AMDGPU back-end uses the following address space mapping: +The AMDGPU backend uses the following address space mappings. + +The memory space names used in the table, aside from the region memory space, is +from the OpenCL standard. + +LLVM Address Space number is used throughout LLVM (for example, in LLVM IR). + + .. table:: Address Space Mapping + :name: amdgpu-address-space-mapping-table + + ================== ================= ================= ================= ================= + LLVM Address Space Memory Space + ------------------ ----------------------------------------------------------------------- + \ Current Default amdgiz/amdgizcl hcc Future Default + ================== ================= ================= ================= ================= + 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat) + 1 Global Global Global Global + 2 Constant Constant Constant Region (GDS) + 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS) + 4 Generic (Flat) Region (GDS) Region (GDS) Constant + 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch) + ================== ================= ================= ================= ================= + +Current Default + This is the current default address space mapping used for all languages + except hcc. This will shortly be deprecated. + +amdgiz/amdgizcl + This is the current address space mapping used when ``amdgiz`` or ``amdgizcl`` + is specified as the target triple environment value. + +hcc + This is the current address space mapping used when ``hcc`` is specified as + the target triple environment value.This will shortly be deprecated. + +Future Default + This will shortly be the only address space mapping for all languages using + AMDGPU backend. + +.. _amdgpu-memory-scopes: + +Memory Scopes +------------- + +This section provides LLVM memory synchronization scopes supported by the AMDGPU +backend memory model when the target triple OS is ``amdhsa`` (see +:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). + +The memory model supported is based on the HSA memory model [HSA]_ which is +based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before +relation is transitive over the synchonizes-with relation independent of scope, +and synchonizes-with allows the memory scope instances to be inclusive (see +table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`). + +This is different to the OpenCL [OpenCL]_ memory model which does not have scope +inclusion and requires the memory scopes to exactly match. However, this +is conservatively correct for OpenCL. + + .. table:: AMDHSA LLVM Sync Scopes for AMDHSA + :name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table + + ================ ========================================================== + LLVM Sync Scope Description + ================ ========================================================== + *none* The default: ``system``. + + Synchronizes with, and participates in modification and + seq_cst total orderings with, other operations (except + image operations) for all address spaces (except private, + or generic that accesses private) provided the other + operation's sync scope is: + + - ``system``. + - ``agent`` and executed by a thread on the same agent. + - ``workgroup`` and executed by a thread in the same + workgroup. + - ``wavefront`` and executed by a thread in the same + wavefront. + + ``agent`` Synchronizes with, and participates in modification and + seq_cst total orderings with, other operations (except + image operations) for all address spaces (except private, + or generic that accesses private) provided the other + operation's sync scope is: + + - ``system`` or ``agent`` and executed by a thread on the + same agent. + - ``workgroup`` and executed by a thread in the same + workgroup. + - ``wavefront`` and executed by a thread in the same + wavefront. + + ``workgroup`` Synchronizes with, and participates in modification and + seq_cst total orderings with, other operations (except + image operations) for all address spaces (except private, + or generic that accesses private) provided the other + operation's sync scope is: + + - ``system``, ``agent`` or ``workgroup`` and executed by a + thread in the same workgroup. + - ``wavefront`` and executed by a thread in the same + wavefront. + + ``wavefront`` Synchronizes with, and participates in modification and + seq_cst total orderings with, other operations (except + image operations) for all address spaces (except private, + or generic that accesses private) provided the other + operation's sync scope is: + + - ``system``, ``agent``, ``workgroup`` or ``wavefront`` + and executed by a thread in the same wavefront. + + ``singlethread`` Only synchronizes with, and participates in modification + and seq_cst total orderings with, other operations (except + image operations) running in the same thread for all + address spaces (for example, in signal handlers). + ================ ========================================================== + +AMDGPU Intrinsics +----------------- + +The AMDGPU backend implements the following intrinsics. + +*This section is WIP.* + +.. TODO + List AMDGPU intrinsics + +Code Object +=========== + +The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that +can be linked by ``lld`` to produce a standard ELF shared code object which can +be loaded and executed on an AMDGPU target. + +Header +------ + +The AMDGPU backend uses the following ELF header: + + .. table:: AMDGPU ELF Header + :name: amdgpu-elf-header-table + + ========================== ========================= + Field Value + ========================== ========================= + ``e_ident[EI_CLASS]`` ``ELFCLASS64`` + ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` + ``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA`` + ``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA`` + ``e_type`` ``ET_REL`` or ``ET_DYN`` + ``e_machine`` ``EM_AMDGPU`` + ``e_entry`` 0 + ``e_flags`` 0 + ========================== ========================= + +.. + + .. table:: AMDGPU ELF Header Enumeration Values + :name: amdgpu-elf-header-enumeration-values-table + + ============================ ===== + Name Value + ============================ ===== + ``EM_AMDGPU`` 224 + ``ELFOSABI_AMDGPU_HSA`` 64 + ``ELFABIVERSION_AMDGPU_HSA`` 1 + ============================ ===== + +``e_ident[EI_CLASS]`` + The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit + applications. + +``e_ident[EI_DATA]`` + All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering. + +``e_ident[EI_OSABI]`` + The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to + specify that the code object conforms to the AMD HSA runtime ABI [HSA]_. + +``e_ident[EI_ABIVERSION]`` + The AMD GPU architecture specific OS ABI version of + ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime + ABI to which the code object conforms. + +``e_type`` + Can be one of the following values: + + + ``ET_REL`` + The type produced by the AMD GPU backend compiler as it is relocatable code + object. + + ``ET_DYN`` + The type produced by the linker as it is a shared code object. + + The AMD HSA runtime loader requires a ``ET_DYN`` code object. + +``e_machine`` + The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU + architecture family. The specific member is specified in the ``NT_HSA_ISA`` + entry in the ``.note`` section (see :ref:`amdgpu-note-records`). + +``e_entry`` + The entry point is 0 as the entry points for individual kernels must be + selected in order to invoke them through AQL packets. + +``e_flags`` + The value is 0 as no flags are used. + +Sections +-------- + +An AMDGPU target ELF code object has the standard ELF sections which include: + + .. table:: AMDGPU ELF Sections + :name: amdgpu-elf-sections-table + + ================== ================ ================================= + Name Type Attributes + ================== ================ ================================= + ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` + ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` + ``.debug_``\ *\** ``SHT_PROGBITS`` *none* + ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` + ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` + ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` + ``.note`` ``SHT_NOTE`` *none* + ``.rela``\ *name* ``SHT_RELA`` *none* + ``.rela.dyn`` ``SHT_RELA`` *none* + ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``.shstrtab`` ``SHT_STRTAB`` *none* + ``.strtab`` ``SHT_STRTAB`` *none* + ``.symtab`` ``SHT_SYMTAB`` *none* + ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` + ================== ================ ================================= + +These sections have their standard meanings (see [ELF]_) and are only generated +if needed. + +``.debug``\ *\** + The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the + DWARF produced by the AMDGPU backend. + +``.dynamic``, ``.dynstr``, ``.dynstr``, ``.hash`` + The standard sections used by a dynamic loader. + +``.note`` + See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU + backend. + +``.rela``\ *name*, ``.rela.dyn`` + For relocatable code objects, *name* is the name of the section that the + relocation records apply. For example, ``.rela.text`` is the section name for + relocation records associated with the ``.text`` section. + + For linked shared code objects, ``.rela.dyn`` contains all the relocation + records from each of the relocatable code object's ``.rela``\ *name* sections. + + See :ref:`amdgpu-relocation-records` for the relocation records supported by + the AMDGPU backend. + +``.text`` + The executable machine code for the kernels and functions they call. Generated + as position independent code. See :ref:`amdgpu-code-conventions` for + information on conventions used in the isa generation. + +.. _amdgpu-note-records: + +Note Records +------------ + +As required by ``ELFCLASS64``, minimal zero byte padding must be generated after +the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition, +minimal zero byte padding must be generated to ensure the ``desc`` field size is +a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must +be at least 4 to indicate at least 8 byte alignment. + +The AMDGPU backend code object uses the following ELF note records in the +``.note`` section. The *Description* column specifies the layout of the note +record’s ``desc`` field. All fields are consecutive bytes. Note records with +variable size strings have a corresponding ``*_size`` field that specifies the +number of bytes, including the terminating null character, in the string. The +string(s) come immediately after the preceding fields. + +Additional note records can be present. + + .. table:: AMDGPU ELF Note Records + :name: amdgpu-elf-note-records-table - ================== =================== ============== - LLVM Address Space DWARF Address Space Memory Space - ================== =================== ============== - 0 1 Private - 1 N/A Global - 2 N/A Constant - 3 2 Local - 4 N/A Generic (Flat) - 5 N/A Region - ================== =================== ============== + ===== ========================== ========================================== + Name Type Description + ===== ========================== ========================================== + "AMD" ``NT_AMD_AMDGPU_METADATA`` + "AMD" ``NT_AMD_AMDGPU_ISA`` + ===== ========================== ========================================== -The terminology in the table, aside from the region memory space, is from the -OpenCL standard. +.. -LLVM Address Space is used throughout LLVM (for example, in LLVM IR). DWARF -Address Space is emitted in DWARF, and is used by tools, such as debugger, -profiler and others. + .. table:: AMDGPU ELF Note Record Enumeration Values + :name: amdgpu-elf-note-record-enumeration-values-table + + ============================= ===== + Name Value + ============================= ===== + *reserved* 0-9 + ``NT_AMD_AMDGPU_METADATA`` 10 + ``NT_AMD_AMDGPU_ISA`` 11 + ============================= ===== + +``NT_AMD_AMDGPU_ISA`` + Specifies the instruction set architecture used by the machine code contained + in the code object. + + This note record is required for code objects containing machine code for + processors matching the ``amdgcn`` architecture in table + :ref:`amdgpu-processors`. + + The null terminated string has the following syntax: + + *architecture*\ ``-``\ *vendor*\ ``-``\ *os*\ ``-``\ *environment*\ ``-``\ *processor* + + where: + + *architecture* + The architecture from table :ref:`amdgpu-target-triples-table`. + + This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see + :ref:`amdgpu-target-triples`). + + *vendor* + The vendor from table :ref:`amdgpu-target-triples-table`. + + For the AMDGPU backend this is always ``amd``. + + *os* + The OS from table :ref:`amdgpu-target-triples-table`. + + *environment* + An environment from table :ref:`amdgpu-target-triples-table`, or blank if + the environment has no affect on the execution of the code object. + + For the AMDGPU backend this is currently always blank. + *processor* + The processor from table :ref:`amdgpu-processors-table`. + + For example: + + ``amdgcn-amd-amdhsa--gfx901`` + +``NT_AMD_AMDGPU_METADATA`` + Specifies extensible metadata associated with the code object. See + :ref:`amdgpu-code-object-metadata` for the syntax of the code object metadata + string. + + This note record is required and must contain the minimum information + necessary to support the ROCM kernel queries. For example, the segment sizes + needed in a dispatch packet. In addition, a high level language runtime may + require other information to be included. For example, the AMD OpenCL runtime + records kernel argument information. + + .. TODO + Is the string null terminated? It probably should not if YAML allows it to + contain null characters, otherwise it should be. + +.. _amdgpu-code-object-metadata: + +Code Object Metadata +-------------------- + +The code object metadata is specified by the ``NT_AMD_AMDHSA_METADATA`` note +record (see :ref:`amdgpu-note-records`). + +The metadata is specified as a YAML formated string (see [YAML]_ and +:doc:`YamlIO`). + +The metadata is represented as a single YAML document comprised of the mapping +defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and +referenced tables. + +For boolean values, 0 is used for false and 1 for true. + +Additional information can be added to the mappings. To avoid conflicts, any +non-AMD key names should be prefixed by "*vendor-name*.". + + .. table:: AMDHSA Code Object Metadata Mapping + :name: amdgpu-amdhsa-code-object-metadata-mapping-table + + ========== ============== ========= ======================================= + String Key Value Type Required? Description + ========== ============== ========= ======================================= + "Version" sequence of Required - The first integer is the major + 2 integers version. Currently 1. + - The second integer is the minor + version. Currently 0. + "Printf" sequence of Each string is encoded information + strings about a printf function call. The + encoded information is organized as + fields separated by colon (':'): + + ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` + + where: + + ``ID`` + A 32 bit integer as a unique id for + each printf function call + + ``N`` + A 32 bit integer equal to the number + of arguments of printf function call + minus 1 + + ``S[i]`` (where i = 0, 1, ... , N-1) + 32 bit integers for the size in bytes + of the i-th FormatString argument of + the printf function call + + FormatString + The format string passed to the + printf function call. + "Kernels" sequence of Required Sequence of the mappings for each + mapping kernel in the code object. See + :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table` + for the definition of the mapping. + ========== ============== ========= ======================================= + +.. + + .. table:: AMDHSA Code Object Kernel Metadata Mapping + :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table + + ================= ============== ========= ================================ + String Key Value Type Required? Description + ================= ============== ========= ================================ + "Name" string Required Source name of the kernel. + "SymbolName" string Required Name of the kernel + descriptor ELF symbol. + "Language" string Source language of the kernel. + Values include: + + - "OpenCL C" + - "OpenCL C++" + - "HCC" + - "OpenMP" + + "LanguageVersion" sequence of - The first integer is the major + 2 integers version. + - The second integer is the + minor version. + "Attrs" mapping Mapping of kernel attributes. + See + :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table` + for the mapping definition. + "Arguments" sequence of Sequence of mappings of the + mapping kernel arguments. See + :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table` + for the definition of the mapping. + "CodeProps" mapping Mapping of properties related to + the kernel code. See + :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table` + for the mapping definition. + "DebugProps" mapping Mapping of properties related to + the kernel debugging. See + :ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table` + for the mapping definition. + ================= ============== ========= ================================ + +.. + + .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping + :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table + + =================== ============== ========= ============================== + String Key Value Type Required? Description + =================== ============== ========= ============================== + "ReqdWorkGroupSize" sequence of The dispatch work-group size + 3 integers X, Y, Z must correspond to the + specified values. + + Corresponds to the OpenCL + ``reqd_work_group_size`` + attribute. + "WorkGroupSizeHint" sequence of The dispatch work-group size + 3 integers X, Y, Z is likely to be the + specified values. + + Corresponds to the OpenCL + ``work_group_size_hint`` + attribute. + "VecTypeHint" string The name of a scalar or vector + type. + + Corresponds to the OpenCL + ``vec_type_hint`` attribute. + =================== ============== ========= ============================== + +.. + + .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping + :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table + + ================= ============== ========= ================================ + String Key Value Type Required? Description + ================= ============== ========= ================================ + "Name" string Kernel argument name. + "TypeName" string Kernel argument type name. + "Size" integer Required Kernel argument size in bytes. + "Align" integer Required Kernel argument alignment in + bytes. Must be a power of two. + "ValueKind" string Required Kernel argument kind that + specifies how to set up the + corresponding argument. + Values include: + + "ByValue" + The argument is copied + directly into the kernarg. + + "GlobalBuffer" + A global address space pointer + to the buffer data is passed + in the kernarg. + + "DynamicSharedPointer" + A group address space pointer + to dynamically allocated LDS + is passed in the kernarg. + + "Sampler" + A global address space + pointer to a S# is passed in + the kernarg. + + "Image" + A global address space + pointer to a T# is passed in + the kernarg. + + "Pipe" + A global address space pointer + to an OpenCL pipe is passed in + the kernarg. + + "Queue" + A global address space pointer + to an OpenCL device enqueue + queue is passed in the + kernarg. + + "HiddenGlobalOffsetX" + The OpenCL grid dispatch + global offset for the X + dimension is passed in the + kernarg. + + "HiddenGlobalOffsetY" + The OpenCL grid dispatch + global offset for the Y + dimension is passed in the + kernarg. + + "HiddenGlobalOffsetZ" + The OpenCL grid dispatch + global offset for the Z + dimension is passed in the + kernarg. + + "HiddenNone" + An argument that is not used + by the kernel. Space needs to + be left for it, but it does + not need to be set up. + + "HiddenPrintfBuffer" + A global address space pointer + to the runtime printf buffer + is passed in kernarg. + + "HiddenDefaultQueue" + A global address space pointer + to the OpenCL device enqueue + queue that should be used by + the kernel by default is + passed in the kernarg. + + "HiddenCompletionAction" + *TBD* + + .. TODO + Add description. + + "ValueType" string Required Kernel argument value type. Only + present if "ValueKind" is + "ByValue". For vector data + types, the value is for the + element type. Values include: + + - "Struct" + - "I8" + - "U8" + - "I16" + - "U16" + - "F16" + - "I32" + - "U32" + - "F32" + - "I64" + - "U64" + - "F64" + + .. TODO + How can it be determined if a + vector type, and what size + vector? + "PointeeAlign" integer Alignment in bytes of pointee + type for pointer type kernel + argument. Must be a power + of 2. Only present if + "ValueKind" is + "DynamicSharedPointer". + "AddrSpaceQual" string Kernel argument address space + qualifier. Only present if + "ValueKind" is "GlobalBuffer" or + "DynamicSharedPointer". Values + are: + + - "Private" + - "Global" + - "Constant" + - "Local" + - "Generic" + - "Region" + + .. TODO + Is GlobalBuffer only Global + or Constant? Is + DynamicSharedPointer always + Local? Can HCC allow Generic? + How can Private or Region + ever happen? + "AccQual" string Kernel argument access + qualifier. Only present if + "ValueKind" is "Image" or + "Pipe". Values + are: + + - "ReadOnly" + - "WriteOnly" + - "ReadWrite" + + .. TODO + Does this apply to + GlobalBuffer? + "ActualAcc" string The actual memory accesses + performed by the kernel on the + kernel argument. Only present if + "ValueKind" is "GlobalBuffer", + "Image", or "Pipe". This may be + more restrictive than indicated + by "AccQual" to reflect what the + kernel actual does. If not + present then the runtime must + assume what is implied by + "AccQual" and "IsConst". Values + are: + + - "ReadOnly" + - "WriteOnly" + - "ReadWrite" + + "IsConst" boolean Indicates if the kernel argument + is const qualified. Only present + if "ValueKind" is + "GlobalBuffer". + + "IsRestrict" boolean Indicates if the kernel argument + is restrict qualified. Only + present if "ValueKind" is + "GlobalBuffer". + + "IsVolatile" boolean Indicates if the kernel argument + is volatile qualified. Only + present if "ValueKind" is + "GlobalBuffer". + + "IsPipe" boolean Indicates if the kernel argument + is pipe qualified. Only present + if "ValueKind" is "Pipe". + + .. TODO + Can GlobalBuffer be pipe + qualified? + ================= ============== ========= ================================ + +.. + + .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping + :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table + + ============================ ============== ========= ===================== + String Key Value Type Required? Description + ============================ ============== ========= ===================== + "KernargSegmentSize" integer Required The size in bytes of + the kernarg segment + that holds the values + of the arguments to + the kernel. + "GroupSegmentFixedSize" integer Required The amount of group + segment memory + required by a + work-group in + bytes. This does not + include any + dynamically allocated + group segment memory + that may be added + when the kernel is + dispatched. + "PrivateSegmentFixedSize" integer Required The amount of fixed + private address space + memory required for a + work-item in + bytes. If + IsDynamicCallstack + is 1 then additional + space must be added + to this value for the + call stack. + "KernargSegmentAlign" integer Required The maximum byte + alignment of + arguments in the + kernarg segment. Must + be a power of 2. + "WavefrontSize" integer Required Wavefront size. Must + be a power of 2. + "NumSGPRs" integer Number of scalar + registers used by a + wavefront for + GFX6-GFX9. This + includes the special + SGPRs for VCC, Flat + Scratch (GFX7-GFX9) + and XNACK (for + GFX8-GFX9). It does + not include the 16 + SGPR added if a trap + handler is + enabled. It is not + rounded up to the + allocation + granularity. + "NumVGPRs" integer Number of vector + registers used by + each work-item for + GFX6-GFX9 + "MaxFlatWorkgroupSize" integer Maximum flat + work-group size + supported by the + kernel in work-items. + "IsDynamicCallStack" boolean Indicates if the + generated machine + code is using a + dynamically sized + call stack. + "IsXNACKEnabled" boolean Indicates if the + generated machine + code is capable of + supporting XNACK. + ============================ ============== ========= ===================== + +.. + + .. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping + :name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table + + =================================== ============== ========= ============== + String Key Value Type Required? Description + =================================== ============== ========= ============== + "DebuggerABIVersion" string + "ReservedNumVGPRs" integer + "ReservedFirstVGPR" integer + "PrivateSegmentBufferSGPR" integer + "WavefrontPrivateSegmentOffsetSGPR" integer + =================================== ============== ========= ============== + +.. _amdgpu-symbols: + +Symbols +------- + +Symbols include the following: + + .. table:: AMDGPU ELF Symbols + :name: amdgpu-elf-symbols-table + + ===================== ============== ============= ================== + Name Type Section Description + ===================== ============== ============= ================== + *link-name* ``STT_OBJECT`` - ``.data`` Global variable + - ``.rodata`` + - ``.bss`` + *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor + *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point + ===================== ============== ============= ================== + +Global variable + Global variables both used and defined by the compilation unit. + + If the symbol is defined in the compilation unit then it is allocated in the + appropriate section according to if it has initialized data or is readonly. + + If the symbol is external then its section is ``STN_UNDEF`` and the loader + will resolve relocations using the defintion provided by another code object + or explicitly defined by the runtime. + + All global symbols, whether defined in the compilation unit or external, are + accessed by the machine code indirectly throught a GOT table entry. This + allows them to be preemptable. The GOT table is only supported when the target + triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). + + .. TODO + Add description of linked shared object symbols. Seems undefined symbols + are marked as STT_NOTYPE. + +Kernel descriptor + Every HSA kernel has an associated kernel descriptor. It is the address of the + kernel descriptor that is used in the AQL dispatch packet used to invoke the + kernel, not the kernel entry point. The layout of the HSA kernel descriptor is + defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. + +Kernel entry point + Every HSA kernel also has a symbol for its machine code entry point. + +.. _amdgpu-relocation-records: + +Relocation Records +------------------ + +AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported +relocatable fields are: + +``word32`` + This specifies a 32-bit field occupying 4 bytes with arbitrary byte + alignment. These values use the same byte order as other word values in the + AMD GPU architecture. + +``word64`` + This specifies a 64-bit field occupying 8 bytes with arbitrary byte + alignment. These values use the same byte order as other word values in the + AMD GPU architecture. + +Following notations are used for specifying relocation calculations: + +**A** + Represents the addend used to compute the value of the relocatable field. + +**G** + Represents the offset into the global offset table at which the relocation + entry’s symbol will reside during execution. + +**GOT** + Represents the address of the global offset table. + +**P** + Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) + of the storage unit being relocated (computed using ``r_offset``). + +**S** + Represents the value of the symbol whose index resides in the relocation + entry. + +The following relocation types are supported: + + .. table:: AMDGPU ELF Relocation Records + :name: amdgpu-elf-relocation-records-table + + ========================== ===== ========== ============================== + Relocation Type Value Field Calculation + ========================== ===== ========== ============================== + ``R_AMDGPU_NONE`` 0 *none* *none* + ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF + ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32 + ``R_AMDGPU_ABS64`` 3 ``word64`` S + A + ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P + ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P + ``R_AMDGPU_ABS32`` 6 ``word32`` S + A + ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P + ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF + ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32 + ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF + ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32 + ========================== ===== ========== ============================== + +.. _amdgpu-dwarf: + +DWARF +----- + +Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain +information that maps the code object executable code and data to the source +language constructs. It can be used by tools such as debuggers and profilers. + +Address Space Mapping +~~~~~~~~~~~~~~~~~~~~~ + +The following address space mapping is used: + + .. table:: AMDGPU DWARF Address Space Mapping + :name: amdgpu-dwarf-address-space-mapping-table + + =================== ================= + DWARF Address Space Memory Space + =================== ================= + 1 Private (Scratch) + 2 Local (group/LDS) + *omitted* Global + *omitted* Constant + *omitted* Generic (Flat) + *not supported* Region (GDS) + =================== ================= + +See :ref:`amdgpu-address-spaces` for infomration on the memory space terminology +used in the table. + +An ``address_class`` attribute is generated on pointer type DIEs to specify the +DWARF address space of the value of the pointer when it is in the *private* or +*local* address space. Otherwise the attribute is omitted. + +An ``XDEREF`` operation is generated in location list expressions for variables +that are allocated in the *private* and *local* address space. Otherwise no +``XDREF`` is omitted. + +Register Mapping +~~~~~~~~~~~~~~~~ + +*This section is WIP.* + +.. TODO + Define DWARF register enumeration. + + If want to present a wavefront state then should expose vector registers as + 64 wide (rather than per work-item view that LLVM uses). Either as seperate + registers, or a 64x4 byte single register. In either case use a new LANE op + (akin to XDREF) to select the current lane usage in a location + expression. This would also allow scalar register spilling to vector register + lanes to be expressed (currently no debug information is being generated for + spilling). If choose a wide single register approach then use LANE in + conjunction with PIECE operation to select the dword part of the register for + the current lane. If the separate register approach then use LANE to select + the register. + +Source Text +~~~~~~~~~~~ + +*This section is WIP.* + +.. TODO + DWARF extension to include runtime generated source text. + +.. _amdgpu-code-conventions: + +Code Conventions +================ + +AMDHSA +------ + +This section provides code conventions used when the target triple OS is +``amdhsa`` (see :ref:`amdgpu-target-triples`). + +Kernel Dispatch +~~~~~~~~~~~~~~~ + +The HSA architected queuing language (AQL) defines a user space memory interface +that can be used to control the dispatch of kernels, in an agent independent +way. An agent can have zero or more AQL queues created for it using the ROCm +runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the +*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue +mechanics and packet layouts. + +The packet processor of a kernel agent is responsible for detecting and +dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the +packet processor is implemented by the hardware command processor (CP), +asynchronous dispatch controller (ADC) and shader processor input controller +(SPI). + +The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel +mode driver to initialize and register the AQL queue with CP. + +To dispatch a kernel the following actions are performed. This can occur in the +CPU host program, or from an HSA kernel executing on a GPU. + +1. A pointer to an AQL queue for the kernel agent on which the kernel is to be + executed is obtained. +2. A pointer to the kernel descriptor (see + :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is + obtained. It must be for a kernel that is contained in a code object that that + was loaded by the ROCm runtime on the kernel agent with which the AQL queue is + associated. +3. Space is allocated for the kernel arguments using the ROCm runtime allocator + for a memory region with the kernarg property for the kernel agent that will + execute the kernel. It must be at least 16 byte aligned. +4. Kernel argument values are assigned to the kernel argument memory + allocation. The layout is defined in the *HSA Programmer’s Language Reference* + [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument + memory in the same way constant memory is accessed. (Note that the HSA + specification allows an implementation to copy the kernel argument contents to + another location that is accessed by the kernel.) +5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime + api uses 64 bit atomic operations to reserve space in the AQL queue for the + packet. The packet must be set up, and the final write must use an atomic + store release to set the packet kind to ensure the packet contents are + visible to the kernel agent. AQL defines a doorbell signal mechanism to + notify the kernel agent that the AQL queue has been updated. These rules, and + the layout of the AQL queue and kernel dispatch packet is defined in the *HSA + System Architecture Specification* [HSA]_. +6. A kernel dispatch packet includes information about the actual dispatch, + such as grid and work-group size, together with information from the code + object about the kernel, such as segment sizes. The ROCm runtime queries on + the kernel symbol can be used to obtain the code object values which are + recorded in the :ref:`amdgpu-code-object-metadata`. +7. CP executes micro-code and is responsible for detecting and setting up the + GPU to execute the wavefronts of a kernel dispatch. +8. CP ensures that when the a wavefront starts executing the kernel machine + code, the scalar general purpose registers (SGPR) and vector general purpose + registers (VGPR) are set up as required by the machine code. The required + setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial + register state is defined in + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. +9. The prolog of the kernel machine code (see + :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary + before continuing executing the machine code that corresponds to the kernel. +10. When the kernel dispatch has completed execution, CP signals the completion + signal specified in the kernel dispatch packet if not 0. + +.. _amdgpu-amdhsa-memory-spaces: + +Memory Spaces +~~~~~~~~~~~~~ + +The memory space properties are: + + .. table:: AMDHSA Memory Spaces + :name: amdgpu-amdhsa-memory-spaces-table + + ================= =========== ======== ======= ================== + Memory Space Name HSA Segment Hardware Address NULL Value + Name Name Size + ================= =========== ======== ======= ================== + Private private scratch 32 0x00000000 + Local group LDS 32 0xFFFFFFFF + Global global global 64 0x0000000000000000 + Constant constant *same as 64 0x0000000000000000 + global* + Generic flat flat 64 0x0000000000000000 + Region N/A GDS 32 *not implemented + for AMDHSA* + ================= =========== ======== ======= ================== + +The global and constant memory spaces both use global virtual addresses, which +are the same virtual address space used by the CPU. However, some virtual +addresses may only be accessible to the CPU, some only accessible by the GPU, +and some by both. + +Using the constant memory space indicates that the data will not change during +the execution of the kernel. This allows scalar read instructions to be +used. The vector and scalar L1 caches are invalidated of volatile data before +each kernel dispatch execution to allow constant memory to change values between +kernel dispatches. + +The local memory space uses the hardware Local Data Store (LDS) which is +automatically allocated when the hardware creates work-groups of wavefronts, and +freed when all the wavefronts of a work-group have terminated. The data store +(DS) instructions can be used to access it. + +The private memory space uses the hardware scratch memory support. If the kernel +uses scratch, then the hardware allocates memory that is accessed using +wavefront lane dword (4 byte) interleaving. The mapping used from private +address to physical address is: + + ``wavefront-scratch-base + + (private-address * wavefront-size * 4) + + (wavefront-lane-id * 4)`` + +There are different ways that the wavefront scratch base address is determined +by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This +memory can be accessed in an interleaved manner using buffer instruction with +the scratch buffer descriptor and per wave scratch offset, by the scratch +instructions, or by flat instructions. If each lane of a wavefront accesses the +same private address, the interleaving results in adjacent dwords being accessed +and hence requires fewer cache lines to be fetched. Multi-dword access is not +supported except by flat and scratch instructions in GFX9. + +The generic address space uses the hardware flat address support available in +GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and +local appertures), that are outside the range of addressible global memory, to +map from a flat address to a private or local address. + +FLAT instructions can take a flat address and access global, private (scratch) +and group (LDS) memory depending in if the address is within one of the +apperture ranges. Flat access to scratch requires hardware aperture setup and +setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat +access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup +(see :ref:`amdgpu-amdhsa-m0`). + +To convert between a segment address and a flat address the base address of the +appertures address can be used. For GFX7-GFX8 these are available in the +:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with +Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For +GFX9 the appature base addresses are directly available as inline constant +registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit +address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32 +which makes it easier to convert from flat to segment or segment to flat. + +HSA Image and Samplers +~~~~~~~~~~~~~~~~~~~~~~ + +Image and sample handles created by the ROCm runtime are 64 bit addresses of a +hardware 32 byte V# and 48 byte S# object respectively. In order to support the +HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG +enumeration values for the queries that are not trivially deducible from the S# +representation. + +HSA Signals +~~~~~~~~~~~ + +Signal handles created by the ROCm runtime are 64 bit addresses of a structure +allocated in memory accessible from both the CPU and GPU. The structure is +defined by the ROCm runtime and subject to change between releases (see +[AMD-ROCm-github]_). + +.. _amdgpu-amdhsa-hsa-aql-queue: + +HSA AQL Queue +~~~~~~~~~~~~~ + +The AQL queue structure is defined by the ROCm runtime and subject to change +between releases (see [AMD-ROCm-github]_). For some processors it contains +fields needed to implement certain language features such as the flat address +aperture bases. It also contains fields used by CP such as managing the +allocation of scratch memory. + +.. _amdgpu-amdhsa-kernel-descriptor: + +Kernel Descriptor +~~~~~~~~~~~~~~~~~ + +A kernel descriptor consists of the information needed by CP to initiate the +execution of a kernel, including the entry point address of the machine code +that implements the kernel. + +Kernel Descriptor for GFX6-GFX9 ++++++++++++++++++++++++++++++++ + +CP microcode requires the Kernel descritor to be allocated on 64 byte alignment. + + .. table:: Kernel Descriptor for GFX6-GFX9 + :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table + + ======= ======= =============================== =========================== + Bits Size Field Name Description + ======= ======= =============================== =========================== + 31:0 4 bytes group_segment_fixed_size The amount of fixed local + address space memory + required for a work-group + in bytes. This does not + include any dynamically + allocated local address + space memory that may be + added when the kernel is + dispatched. + 63:32 4 bytes private_segment_fixed_size The amount of fixed + private address space + memory required for a + work-item in bytes. If + is_dynamic_callstack is 1 + then additional space must + be added to this value for + the call stack. + 95:64 4 bytes max_flat_workgroup_size Maximum flat work-group + size supported by the + kernel in work-items. + 96 1 bit is_dynamic_call_stack Indicates if the generated + machine code is using a + dynamically sized call + stack. + 97 1 bit is_xnack_enabled Indicates if the generated + machine code is capable of + suppoting XNACK. + 127:98 30 bits Reserved. Must be 0. + 191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly + negative) from base + address of kernel + descriptor to kernel's + entry point instruction + which must be 256 byte + aligned. + 383:192 24 Reserved. Must be 0. + bytes + 415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS) + program settings used by + CP to set up + ``COMPUTE_PGM_RSRC1`` + configuration + register. See + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`. + 447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS) + program settings used by + CP to set up + ``COMPUTE_PGM_RSRC2`` + configuration + register. See + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. + 448 1 bit enable_sgpr_private_segment Enable the setup of the + _buffer SGPR user data registers + (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + The total number of SGPR + user data registers + requested must not exceed + 16 and match value in + ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. + Any requests beyond 16 + will be ignored. + 449 1 bit enable_sgpr_dispatch_ptr *see above* + 450 1 bit enable_sgpr_queue_ptr *see above* + 451 1 bit enable_sgpr_kernarg_segment_ptr *see above* + 452 1 bit enable_sgpr_dispatch_id *see above* + 453 1 bit enable_sgpr_flat_scratch_init *see above* + 454 1 bit enable_sgpr_private_segment *see above* + _size + 455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and + _count_X should always be 0. + 456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and + _count_Y should always be 0. + 457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and + _count_Z should always be 0. + 463:458 6 bits Reserved. Must be 0. + 511:464 4 Reserved. Must be 0. + bytes + 512 **Total size 64 bytes.** + ======= =================================================================== + +.. + + .. table:: compute_pgm_rsrc1 for GFX6-GFX9 + :name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table + + ======= ======= =============================== =========================== + Bits Size Field Name Description + ======= ======= =============================== =========================== + 5:0 6 bits granulated_workitem_vgpr_count Number of vector registers + used by each work-item, + granularity is device + specific: + + GFX6-9 + roundup((max-vgpg + 1) + / 4) - 1 + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.VGPRS``. + 9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers + used by a wavefront, + granularity is device + specific: + + GFX6-8 + roundup((max-sgpg + 1) + / 8) - 1 + GFX9 + roundup((max-sgpg + 1) + / 16) - 1 + + Includes the special SGPRs + for VCC, Flat Scratch (for + GFX7 onwards) and XNACK + (for GFX8 onwards). It does + not include the 16 SGPR + added if a trap handler is + enabled. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.SGPRS``. + 11:10 2 bits priority Must be 0. + + Start executing wavefront + at the specified priority. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.PRIORITY``. + 13:12 2 bits float_mode_round_32 Wavefront starts execution + with specified rounding + mode for single (32 + bit) floating point + precision floating point + operations. + + Floating point rounding + mode values are defined in + :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. + 15:14 2 bits float_mode_round_16_64 Wavefront starts execution + with specified rounding + denorm mode for half/double (16 + and 64 bit) floating point + precision floating point + operations. + + Floating point rounding + mode values are defined in + :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. + 17:16 2 bits float_mode_denorm_32 Wavefront starts execution + with specified denorm mode + for single (32 + bit) floating point + precision floating point + operations. + + Floating point denorm mode + values are defined in + :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. + 19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution + with specified denorm mode + for half/double (16 + and 64 bit) floating point + precision floating point + operations. + + Floating point denorm mode + values are defined in + :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. + 20 1 bit priv Must be 0. + + Start executing wavefront + in privilege trap handler + mode. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.PRIV``. + 21 1 bit enable_dx10_clamp Wavefront starts execution + with DX10 clamp mode + enabled. Used by the vector + ALU to force DX-10 style + treatment of NaN's (when + set, clamp NaN to zero, + otherwise pass NaN + through). + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. + 22 1 bit debug_mode Must be 0. + + Start executing wavefront + in single step mode. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. + 23 1 bit enable_ieee_mode Wavefront starts execution + with IEEE mode + enabled. Floating point + opcodes that support + exception flag gathering + will quiet and propagate + signaling-NaN inputs per + IEEE 754-2008. Min_dx10 and + max_dx10 become IEEE + 754-2008 compliant due to + signaling-NaN propagation + and quieting. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.IEEE_MODE``. + 24 1 bit bulky Must be 0. + + Only one work-group allowed + to execute on a compute + unit. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.BULKY``. + 25 1 bit cdbg_user Must be 0. + + Flag that can be used to + control debugging code. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.CDBG_USER``. + 31:26 6 bits Reserved. Must be 0. + 32 **Total size 4 bytes** + ======= =================================================================== + +.. + + .. table:: compute_pgm_rsrc2 for GFX6-GFX9 + :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table + + ======= ======= =============================== =========================== + Bits Size Field Name Description + ======= ======= =============================== =========================== + 0 1 bit enable_sgpr_private_segment Enable the setup of the + _wave_offset SGPR wave scratch offset + system register (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. + 5:1 5 bits user_sgpr_count The total number of SGPR + user data registers + requested. This number must + match the number of user + data registers enabled. + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.USER_SGPR``. + 6 1 bit enable_trap_handler Set to 1 if code contains a + TRAP instruction which + requires a trap hander to + be enabled. + + CP sets + ``COMPUTE_PGM_RSRC2.TRAP_PRESENT`` + if the runtime has + installed a trap handler + regardless of the setting + of this field. + 7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the + system SGPR register for + the work-group id in the X + dimension (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TGID_X_EN``. + 8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the + system SGPR register for + the work-group id in the Y + dimension (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. + 9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the + system SGPR register for + the work-group id in the Z + dimension (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. + 10 1 bit enable_sgpr_workgroup_info Enable the setup of the + system SGPR register for + work-group information (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. + 12:11 2 bits enable_vgpr_workitem_id Enable the setup of the + VGPR system registers used + for the work-item ID. + :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` + defines the values. + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. + 13 1 bit enable_exception_address_watch Must be 0. + + Wavefront starts execution + with address watch + exceptions enabled which + are generated when L1 has + witnessed a thread access + an *address of + interest*. + + CP is responsible for + filling in the address + watch bit in + ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` + according to what the + runtime requests. + 14 1 bit enable_exception_memory Must be 0. + + Wavefront starts execution + with memory violation + exceptions exceptions + enabled which are generated + when a memory violation has + occurred for this wave from + L1 or LDS + (write-to-read-only-memory, + mis-aligned atomic, LDS + address out of range, + illegal address, etc.). + + CP sets the memory + violation bit in + ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` + according to what the + runtime requests. + 23:15 9 bits granulated_lds_size Must be 0. + + CP uses the rounded value + from the dispatch packet, + not this value, as the + dispatch may contain + dynamically allocated group + segment memory. CP writes + directly to + ``COMPUTE_PGM_RSRC2.LDS_SIZE``. + + Amount of group segment + (LDS) to allocate for each + work-group. Granularity is + device specific: + + GFX6: + roundup(lds-size / (64 * 4)) + GFX7-GFX9: + roundup(lds-size / (128 * 4)) + + 24 1 bit enable_exception_ieee_754_fp Wavefront starts execution + _invalid_operation with specified exceptions + enabled. + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.EXCP_EN`` + (set from bits 0..6). + + IEEE 754 FP Invalid + Operation + 25 1 bit enable_exception_fp_denormal FP Denormal one or more + _source input operands is a + denormal number + 26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by + _division_by_zero Zero + 27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow + _overflow + 28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow + _underflow + 29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact + _inexact + 30 1 bit enable_exception_int_divide_by Integer Division by Zero + _zero (rcp_iflag_f32 instruction + only) + 31 1 bit Reserved. Must be 0. + 32 **Total size 4 bytes.** + ======= =================================================================== + +.. + + .. table:: Floating Point Rounding Mode Enumeration Values + :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table + + ===================================== ===== =============================== + Enumeration Name Value Description + ===================================== ===== =============================== + AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even + AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity + AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity + AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 + ===================================== ===== =============================== + +.. + + .. table:: Floating Point Denorm Mode Enumeration Values + :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table + + ===================================== ===== =============================== + Enumeration Name Value Description + ===================================== ===== =============================== + AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination + Denorms + AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms + AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms + AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush + ===================================== ===== =============================== + +.. + + .. table:: System VGPR Work-Item ID Enumeration Values + :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table + + ===================================== ===== =============================== + Enumeration Name Value Description + ===================================== ===== =============================== + AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID. + AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y + dimensions ID. + AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z + dimensions ID. + AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. + ===================================== ===== =============================== + +.. _amdgpu-amdhsa-initial-kernel-execution-state: + +Initial Kernel Execution State +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section defines the register state that will be set up by the packet +processor prior to the start of execution of every wavefront. This is limited by +the constraints of the hardware controllers of CP/ADC/SPI. + +The order of the SGPR registers is defined, but the compiler can specify which +ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit +fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used +for enabled registers are dense starting at SGPR0: the first enabled register is +SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have +an SGPR number. + +The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to +all waves of the grid. It is possible to specify more than 16 User SGPRs using +the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually +initialized. These are then immediately followed by the System SGPRs that are +set up by ADC/SPI and can have different values for each wave of the grid +dispatch. + +SGPR register initial state is defined in +:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + + .. table:: SGPR Register Set Up Order + :name: amdgpu-amdhsa-sgpr-register-set-up-order-table + + ========== ========================== ====== ============================== + SGPR Order Name Number Description + (kernel descriptor enable of + field) SGPRs + ========== ========================== ====== ============================== + First Private Segment Buffer 4 V# that can be used, together + (enable_sgpr_private with Scratch Wave Offset as an + _segment_buffer) offset, to access the private + memory space using a segment + address. + + CP uses the value provided by + the runtime. + then Dispatch Ptr 2 64 bit address of AQL dispatch + (enable_sgpr_dispatch_ptr) packet for kernel dispatch + actually executing. + then Queue Ptr 2 64 bit address of amd_queue_t + (enable_sgpr_queue_ptr) object for AQL queue on which + the dispatch packet was + queued. + then Kernarg Segment Ptr 2 64 bit address of Kernarg + (enable_sgpr_kernarg segment. This is directly + _segment_ptr) copied from the + kernarg_address in the kernel + dispatch packet. + + Having CP load it once avoids + loading it at the beginning of + every wavefront. + then Dispatch Id 2 64 bit Dispatch ID of the + (enable_sgpr_dispatch_id) dispatch packet being + executed. + then Flat Scratch Init 2 This is 2 SGPRs: + (enable_sgpr_flat_scratch + _init) GFX6 + Not supported. + GFX7-GFX8 + The first SGPR is a 32 bit + byte offset from + ``SH_HIDDEN_PRIVATE_BASE_VIMID`` + to per SPI base of memory + for scratch for the queue + executing the kernel + dispatch. CP obtains this + from the runtime. + + This is the same offset used + in computing the Scratch + Segment Buffer base + address. The value of + Scratch Wave Offset must be + added by the kernel machine + code and moved to SGPRn-4 + for use as the FLAT SCRATCH + BASE in flat memory + instructions. + + The second SGPR is 32 bit + byte size of a single + work-item’s scratch memory + usage. This is directly + loaded from the kernel + dispatch packet Private + Segment Byte Size and + rounded up to a multiple of + DWORD. + + The kernel code must move to + SGPRn-3 for use as the FLAT + SCRATCH SIZE in flat memory + instructions. Having CP load + it once avoids loading it at + the beginning of every + wavefront. + GFX9 + This is the 64 bit base + address of the per SPI + scratch backing memory + managed by SPI for the queue + executing the kernel + dispatch. CP obtains this + from the runtime (and + divides it if there are + multiple Shader Arrays each + with its own SPI). The value + of Scratch Wave Offset must + be added by the kernel + machine code and moved to + SGPRn-4 and SGPRn-3 for use + as the FLAT SCRATCH BASE in + flat memory instructions. + then Private Segment Size 1 The 32 bit byte size of a + (enable_sgpr_private single work-item’s scratch + _segment_size) memory allocation. This is the + value from the kernel dispatch + packet Private Segment Byte + Size rounded up by CP to a + multiple of DWORD. + + Having CP load it once avoids + loading it at the beginning of + every wavefront. + + This is not used for + GFX7-GFX8 since it is the same + value as the second SGPR of + Flat Scratch Init. However, it + may be needed for GFX9 which + changes the meaning of the + Flat Scratch Init value. + then Grid Work-Group Count X 1 32 bit count of the number of + (enable_sgpr_grid work-groups in the X dimension + _workgroup_count_X) for the grid being + executed. Computed from the + fields in the kernel dispatch + packet as ((grid_size.x + + workgroup_size.x - 1) / + workgroup_size.x). + then Grid Work-Group Count Y 1 32 bit count of the number of + (enable_sgpr_grid work-groups in the Y dimension + _workgroup_count_Y && for the grid being + less than 16 previous executed. Computed from the + SGPRs) fields in the kernel dispatch + packet as ((grid_size.y + + workgroup_size.y - 1) / + workgroupSize.y). + + Only initialized if <16 + previous SGPRs initialized. + then Grid Work-Group Count Z 1 32 bit count of the number of + (enable_sgpr_grid work-groups in the Z dimension + _workgroup_count_Z && for the grid being + less than 16 previous executed. Computed from the + SGPRs) fields in the kernel dispatch + packet as ((grid_size.z + + workgroup_size.z - 1) / + workgroupSize.z). + + Only initialized if <16 + previous SGPRs initialized. + then Work-Group Id X 1 32 bit work-group id in X + (enable_sgpr_workgroup_id dimension of grid for + _X) wavefront. + then Work-Group Id Y 1 32 bit work-group id in Y + (enable_sgpr_workgroup_id dimension of grid for + _Y) wavefront. + then Work-Group Id Z 1 32 bit work-group id in Z + (enable_sgpr_workgroup_id dimension of grid for + _Z) wavefront. + then Work-Group Info 1 {first_wave, 14’b0000, + (enable_sgpr_workgroup ordered_append_term[10:0], + _info) threadgroup_size_in_waves[5:0]} + then Scratch Wave Offset 1 32 bit byte offset from base + (enable_sgpr_private of scratch base of queue + _segment_wave_offset) executing the kernel + dispatch. Must be used as an + offset with Private + segment address when using + Scratch Segment Buffer. It + must be used to set up FLAT + SCRATCH for flat addressing + (see + :ref:`amdgpu-amdhsa-flat-scratch`). + ========== ========================== ====== ============================== + +The order of the VGPR registers is defined, but the compiler can specify which +ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit +fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used +for enabled registers are dense starting at VGPR0: the first enabled register is +VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a +VGPR number. + +VGPR register initial state is defined in +:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. + + .. table:: VGPR Register Set Up Order + :name: amdgpu-amdhsa-vgpr-register-set-up-order-table + + ========== ========================== ====== ============================== + VGPR Order Name Number Description + (kernel descriptor enable of + field) VGPRs + ========== ========================== ====== ============================== + First Work-Item Id X 1 32 bit work item id in X + (Always initialized) dimension of work-group for + wavefront lane. + then Work-Item Id Y 1 32 bit work item id in Y + (enable_vgpr_workitem_id dimension of work-group for + > 0) wavefront lane. + then Work-Item Id Z 1 32 bit work item id in Z + (enable_vgpr_workitem_id dimension of work-group for + > 1) wavefront lane. + ========== ========================== ====== ============================== + +The setting of registers is is done by GPU CP/ADC/SPI hardware as follows: + +1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data + registers. +2. Work-group Id registers X, Y, Z are set by ADC which supports any + combination including none. +3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value + cannot included with the flat scratch init value which is per queue. +4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) + or (X, Y, Z). + +Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit +value to the hardware required SGPRn-3 and SGPRn-4 respectively. + +The global segment can be accessed either using buffer instructions (GFX6 which +has V# 64 bit address support), flat instructions (GFX7-9), or global +instructions (GFX9). + +If buffer operations are used then the compiler can generate a V# with the +following properties: + +* base address of 0 +* no swizzle +* ATC: 1 if IOMMU present (such as APU) +* ptr64: 1 +* MTYPE set to support memory coherence that matches the runtime (such as CC for + APU and NC for dGPU). + +.. _amdgpu-amdhsa-kernel-prolog: + +Kernel Prolog +~~~~~~~~~~~~~ + +.. _amdgpu-amdhsa-m0: + +M0 +++ + +GFX6-GFX8 + The M0 register must be initialized with a value at least the total LDS size + if the kernel may access LDS via DS or flat operations. Total LDS size is + available in dispatch packet. For M0, it is also possible to use maximum + possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for + GFX7-GFX8). +GFX9 + The M0 register is not used for range checking LDS accesses and so does not + need to be initialized in the prolog. + +.. _amdgpu-amdhsa-flat-scratch: + +Flat Scratch +++++++++++++ + +If the kernel may use flat operations to access scratch memory, the prolog code +must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which +are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave +Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): + +GFX6 + Flat scratch is not supported. + +GFX7-8 + 1. The low word of Flat Scratch Init is 32 bit byte offset from + ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory + being managed by SPI for the queue executing the kernel dispatch. This is + the same value used in the Scratch Segment Buffer V# base address. The + prolog must add the value of Scratch Wave Offset to get the wave's byte + scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since + FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted + by 8 before moving into FLAT_SCRATCH_LO. + 2. The second word of Flat Scratch Init is 32 bit byte size of a single + work-items scratch memory usage. This is directly loaded from the kernel + dispatch packet Private Segment Byte Size and rounded up to a multiple of + DWORD. Having CP load it once avoids loading it at the beginning of every + wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH + SIZE. +GFX9 + The Flat Scratch Init is the 64 bit address of the base of scratch backing + memory being managed by SPI for the queue executing the kernel dispatch. The + prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH + pair for use as the flat scratch base in flat memory instructions. + +.. _amdgpu-amdhsa-memory-model: + +Memory Model +~~~~~~~~~~~~ + +This section describes the mapping of LLVM memory model onto AMDGPU machine code +(see :ref:`memmodel`). *The implementation is WIP.* + +.. TODO + Update when implementation complete. + + Support more relaxed OpenCL memory model to be controled by environment + component of target triple. + +The AMDGPU backend supports the memory synchronization scopes specified in +:ref:`amdgpu-memory-scopes`. + +The code sequences used to implement the memory model are defined in table +:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. + +The sequences specify the order of instructions that a single thread must +execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect +to other memory instructions executed by the same thread. This allows them to be +moved earlier or later which can allow them to be combined with other instances +of the same instruction, or hoisted/sunk out of loops to improve +performance. Only the instructions related to the memory model are given; +additional ``s_waitcnt`` instructions are required to ensure registers are +defined before being used. These may be able to be combined with the memory +model ``s_waitcnt`` instructions as described above. + +The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the +OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before +relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL +memory model which has separate happens-before relations for the global and +local address spaces, and only a fence specifying both global and local address +space joins the relationships. Since the LLVM ``memfence`` instruction does not +allow an address space to be specified the OpenCL fence has to convervatively +assume both local and global address space was specified. However, optimizations +can often be done to eliminate the additional ``s_waitcnt``instructions when +there are no intervening corresponding ``ds/flat_load/store/atomic`` memory +instructions. The code sequences in the table indicate what can be omitted for +the OpenCL memory. The target triple environment is used to determine if the +source language is OpenCL (see :ref:`amdgpu-opencl`). + +``ds/flat_load/store/atomic`` instructions to local memory are termed LDS +operations. + +``buffer/global/flat_load/store/atomic`` instructions to global memory are +termed vector memory operations. + +For GFX6-GFX9: + +* Each agent has multiple compute units (CU). +* Each CU has multiple SIMDs that execute wavefronts. +* The wavefronts for a single work-group are executed in the same CU but may be + executed by different SIMDs. +* Each CU has a single LDS memory shared by the wavefronts of the work-groups + executing on it. +* All LDS operations of a CU are performed as wavefront wide operations in a + global order and involve no caching. Completion is reported to a wavefront in + execution order. +* The LDS memory has multiple request queues shared by the SIMDs of a + CU. Therefore, the LDS operations performed by different waves of a work-group + can be reordered relative to each other, which can result in reordering the + visibility of vector memory operations with respect to LDS operations of other + wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to + ensure synchonization between LDS operations and vector memory operations + between waves of a work-group, but not between operations performed by the + same wavefront. +* The vector memory operations are performed as wavefront wide operations and + completion is reported to a wavefront in execution order. The exception is + that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of + vector memory order if they access LDS memory, and out of LDS operation order + if they access global memory. +* The vector memory operations access a vector L1 cache shared by all wavefronts + on a CU. Therefore, no special action is required for coherence between + wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for + coherence between waves executing in different work-groups as they may be + executing on different CUs. +* The scalar memory operations access a scalar L1 cache shared by all wavefronts + on a group of CUs. The scalar and vector L1 caches are not coherent. However, + scalar operations are used in a restricted way so do not impact the memory + model. See :ref:`amdgpu-amdhsa-memory-spaces`. +* The vector and scalar memory operations use an L2 cache shared by all CUs on + the same agent. +* The L2 cache has independent channels to service disjoint ranges of virtual + addresses. +* Each CU has a separate request queue per channel. Therefore, the vector and + scalar memory operations performed by waves executing in different work-groups + (which may be executing on different CUs) of an agent can be reordered + relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure + synchonization between vector memory operations of different CUs. It ensures a + previous vector memory operation has completed before executing a subsequent + vector memory or LDS operation and so can be used to meet the requirements of + acquire and release. +* The L2 cache can be kept coherent with other agents on some targets, or ranges + of virtual addresses can be set up to bypass it to ensure system coherence. + +Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8), +or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the +memory, atomic memory orderings are not meaningful and all accesses are treated +as non-atomic. + +Constant address space uses ``buffer/global_load`` instructions (or equivalent +scalar memory instructions). Since the constant address space contents do not +change during the execution of a kernel dispatch it is not legal to perform +stores, and atomic memory orderings are not meaningful and all access are +treated as non-atomic. + +A memory synchronization scope wider than work-group is not meaningful for the +group (LDS) address space and is treated as work-group. + +The memory model does not support the region address space which is treated as +non-atomic. + +Acquire memory ordering is not meaningful on store atomic instructions and is +treated as non-atomic. + +Release memory ordering is not meaningful on load atomic instructions and is +treated a non-atomic. + +Acquire-release memory ordering is not meaningful on load or store atomic +instructions and is treated as acquire and release respectively. + +AMDGPU backend only uses scalar memory operations to access memory that is +proven to not change during the execution of the kernel dispatch. This includes +constant address space and global address space for program scope const +variables. Therefore the kernel machine code does not have to maintain the +scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar +and vector L1 caches are invalidated between kernel dispatches by CP since +constant address space data may change between kernel dispatch executions. See +:ref:`amdgpu-amdhsa-memory-spaces`. + +The one exeception is if scalar writes are used to spill SGPR registers. In this +case the AMDGPU backend ensures the memory location used to spill is never +accessed by vector memory operations at the same time. If scalar writes are used +then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function +return since the locations may be used for vector memory instructions by a +future wave that uses the same scratch area, or a function call that creates a +frame at the same address, respectively. There is no need for a ``s_dcache_inv`` +as all scalar writes are write-before-read in the same thread. + +Scratch backing memory (which is used for the private address space) is accessed +with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space +is only accessed by a single thread, and is always write-before-read, +there is never a need to invalidate these entries from the L1 cache. Hence all +cache invalidates are done as ``*_vol`` to only invalidate the volatile cache +lines. + +On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing +to invalidate the L2 cache. This also causes it to be treated as non-volatile +and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache +coherent) and so the L2 cache will coherent with the CPU and other agents. + + .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 + :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table + + ============ ============ ============== ========== ======================= + LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code + Ordering Sync Scope Address + Space + ============ ============ ============== ========== ======================= + **Non-Atomic** + --------------------------------------------------------------------------- + load *none* *none* - global non-volatile + - generic 1. buffer/global/flat_load + volatile + 1. buffer/global/flat_load + glc=1 + load *none* *none* - local 1. ds_load + store *none* *none* - global 1. buffer/global/flat_store + - generic + store *none* *none* - local 1. ds_store + **Unordered Atomic** + --------------------------------------------------------------------------- + load atomic unordered *any* *any* *Same as non-atomic*. + store atomic unordered *any* *any* *Same as non-atomic*. + atomicrmw unordered *any* *any* *Same as monotonic + atomic*. + **Monotonic Atomic** + --------------------------------------------------------------------------- + load atomic monotonic - singlethread - global 1. buffer/global/flat_load + - wavefront - generic + - workgroup + load atomic monotonic - singlethread - local 1. ds_load + - wavefront + - workgroup + load atomic monotonic - agent - global 1. buffer/global/flat_load + - system - generic glc=1 + store atomic monotonic - singlethread - global 1. buffer/global/flat_store + - wavefront - generic + - workgroup + - agent + - system + store atomic monotonic - singlethread - local 1. ds_store + - wavefront + - workgroup + atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic + - wavefront - generic + - workgroup + - agent + - system + atomicrmw monotonic - singlethread - local 1. ds_atomic + - wavefront + - workgroup + **Acquire Atomic** + --------------------------------------------------------------------------- + load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load + - wavefront - local + - generic + load atomic acquire - workgroup - global 1. buffer/global_load + load atomic acquire - workgroup - local 1. ds/flat_load + - generic 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit + waitcnt. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + load atomic acquire - agent - global 1. buffer/global_load + - system glc=1 + 2. s_waitcnt vmcnt(0) + + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the load + has completed + before invalidating + the cache. + + 3. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following + loads will not see + stale global data. + + load atomic acquire - agent - generic 1. flat_load glc=1 + - system 2. s_waitcnt vmcnt(0) & + lgkmcnt(0) + + - If OpenCL omit + lgkmcnt(0). + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the flat_load + has completed + before invalidating + the cache. + + 3. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw acquire - workgroup - global 1. buffer/global_atomic + atomicrmw acquire - workgroup - local 1. ds/flat_atomic + - generic 2. waitcnt lgkmcnt(0) + + - If OpenCL, omit + waitcnt. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + atomicrmw acquire - agent - global 1. buffer/global_atomic + - system 2. s_waitcnt vmcnt(0) + + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the + atomicrmw has + completed before + invalidating the + cache. + + 3. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acquire - agent - generic 1. flat_atomic + - system 2. s_waitcnt vmcnt(0) & + lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the + atomicrmw has + completed before + invalidating the + cache. + + 3. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + fence acquire - singlethread *none* *none* + - wavefront + fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit + waitcnt. However, + since LLVM + currently has no + address space on + the fence need to + conservatively + always generate. If + fence had an + address space then + set to address + space of OpenCL + fence flag, or to + generic if both + local and global + flags are + specified. + - Must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the + value read by the + fence-paired-atomic. + + fence acquire - agent *none* 1. s_waitcnt vmcnt(0) & + - system lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + group/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_wbinvl1_vol. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. + + 2. buffer_wbinvl1_vol + + - Must happen before + any following global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + **Release Atomic** + --------------------------------------------------------------------------- + store atomic release - singlethread - global 1. buffer/global/ds/flat_store + - wavefront - local + - generic + store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) + - generic + - If OpenCL, omit + waitcnt. + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + store. + - Ensures that all + memory operations + to local have + completed before + performing the + store that is being + released. + + 2. buffer/global/flat_store + store atomic release - workgroup - local 1. ds_store + store atomic release - agent - global 1. s_waitcnt vmcnt(0) & + - system - generic lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + store. + - Ensures that all + memory operations + to global have + completed before + performing the + store that is being + released. + + 2. buffer/global/ds/flat_store + atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) + - generic + - If OpenCL, omit + waitcnt. + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global/flat_atomic + atomicrmw release - workgroup - local 1. ds_atomic + atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) & + - system - generic lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global and local + have completed + before performing + the atomicrmw that + is being released. + + 2. buffer/global/ds/flat_atomic* + fence release - singlethread *none* *none* + - wavefront + fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit + waitcnt. However, + since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Must happen after + any preceding + local/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Must happen before + any following store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that all + memory operations + to local have + completed before + performing the + following + fence-paired-atomic. + + fence release - agent *none* 1. s_waitcnt vmcnt(0) & + - system lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + any following store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Ensures that all + memory operations + to global have + completed before + performing the + following + fence-paired-atomic. + + **Acquire-Release Atomic** + --------------------------------------------------------------------------- + atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit + waitcnt. + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global_atomic + atomicrmw acq_rel - workgroup - local 1. ds_atomic + 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit + waitcnt. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + + atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit + waitcnt. + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing the + atomicrmw that is + being released. + + 2. flat_atomic + 3. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit + waitcnt. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the load + atomic value being + acquired. + atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) & + - system lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global have + completed before + performing the + atomicrmw that is + being released. + + 2. buffer/global_atomic + 3. s_waitcnt vmcnt(0) + + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the + atomicrmw has + completed before + invalidating the + cache. + + 4. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) & + - system lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + atomicrmw. + - Ensures that all + memory operations + to global have + completed before + performing the + atomicrmw that is + being released. + + 2. flat_atomic + 3. s_waitcnt vmcnt(0) & + lgkmcnt(0) + + - If OpenCL, omit + lgkmcnt(0). + - Must happen before + following + buffer_wbinvl1_vol. + - Ensures the + atomicrmw has + completed before + invalidating the + cache. + + 4. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. + + fence acq_rel - singlethread *none* *none* + - wavefront + fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit + waitcnt. However, + since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Must happen after + any preceding + local/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that all + memory operations + to local have + completed before + performing any + following global + memory operations. + - Ensures that the + preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic) + has completed + before following + global memory + operations. This + satisfies the + requirements of + acquire. + - Ensures that all + previous memory + operations have + completed before a + following + local/generic store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + This satisfies the + requirements of + release. + + fence acq_rel - agent *none* 1. s_waitcnt vmcnt(0) & + - system lgkmcnt(0) + + - If OpenCL and + address space is + not generic, omit + lgkmcnt(0). + However, since LLVM + currently has no + address space on + the fence need to + conservatively + always generate + (see comment for + previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/store/load + atomic/store + atomic/atomicrmw. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before + the following + buffer_wbinvl1_vol. + - Ensures that the + preceding + global/local/generic + load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic) + has completed + before invalidating + the cache. This + satisfies the + requirements of + acquire. + - Ensures that all + previous memory + operations have + completed before a + following + global/local/generic + store + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + This satisfies the + requirements of + release. + + 2. buffer_wbinvl1_vol + + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures that + following loads + will not see stale + global data. This + satisfies the + requirements of + acquire. + + **Sequential Consistent Atomic** + --------------------------------------------------------------------------- + load atomic seq_cst - singlethread - global *Same as corresponding + - wavefront - local load atomic acquire*. + - workgroup - generic + load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0) + - system - local + - generic - Must happen after + preceding + global/generic load + atomic/store + atomic/atomicrmw + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + waitcnt vmcnt(0) of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + competing out of + order.) + + 2. *Following + instructions same as + corresponding load + atomic acquire*. + + store atomic seq_cst - singlethread - global *Same as corresponding + - wavefront - local store atomic release*. + - workgroup - generic + store atomic seq_cst - agent - global *Same as corresponding + - system - generic store atomic release*. + atomicrmw seq_cst - singlethread - global *Same as corresponding + - wavefront - local atomicrmw acq_rel*. + - workgroup - generic + atomicrmw seq_cst - agent - global *Same as corresponding + - system - generic atomicrmw acq_rel*. + fence seq_cst - singlethread *none* *Same as corresponding + - wavefront fence acq_rel*. + - workgroup + - agent + - system + ============ ============ ============== ========== ======================= + +The memory order also adds the single thread optimization constrains defined in +table +:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`. + + .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9 + :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table + + ============ ============================================================== + LLVM Memory Optimization Constraints + Ordering + ============ ============================================================== + unordered *none* + monotonic *none* + acquire - If a load atomic/atomicrmw then no following load/load + atomic/store/ store atomic/atomicrmw/fence instruction can + be moved before the acquire. + - If a fence then same as load atomic, plus no preceding + associated fence-paired-atomic can be moved after the fence. + release - If a store atomic/atomicrmw then no preceeding load/load + atomic/store/ store atomic/atomicrmw/fence instruction can + be moved after the release. + - If a fence then same as store atomic, plus no following + associated fence-paired-atomic can be moved before the + fence. + acq_rel Same constraints as both acquire and release. + seq_cst - If a load atomic then same constraints as acquire, plus no + preceding sequentially consistent load atomic/store + atomic/atomicrmw/fence instruction can be moved after the + seq_cst. + - If a store atomic then the same constraints as release, plus + no following sequentially consistent load atomic/store + atomic/atomicrmw/fence instruction can be moved before the + seq_cst. + - If an atomicrmw/fence then same constraints as acq_rel. + ============ ============================================================== + +Trap Handler ABI +~~~~~~~~~~~~~~~~ + +For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes +(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports +the ``s_trap`` instruction with the following usage: + + .. table:: AMDGPU Trap Handler for AMDHSA OS + :name: amdgpu-trap-handler-for-amdhsa-os-table + + =================== =============== =============== ======================= + Usage Code Sequence Trap Handler Description + Inputs + =================== =============== =============== ======================= + reserved ``s_trap 0x00`` Reserved by hardware. + ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA + ``queue_ptr`` ``debugtrap`` + ``VGPR0``: intrinsic (not + ``arg`` implemented). + ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be + ``queue_ptr`` terminated and its + associated queue put + into the error state. + ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not + ``queue_ptr`` installed handled + same as ``llvm.trap``. + debugger breakpoint ``s_trap 0x07`` Reserved for debugger + breakpoints. + debugger ``s_trap 0x08`` Reserved for debugger. + debugger ``s_trap 0xfe`` Reserved for debugger. + debugger ``s_trap 0xff`` Reserved for debugger. + =================== =============== =============== ======================= + +Non-AMDHSA +---------- Trap Handler ABI ----------------- -The OS element of the target triple controls the trap handler behavior. - -HSA OS -^^^^^^ -For code objects generated by AMDGPU back-end for the HSA OS, the runtime -installs a trap handler that supports the s_trap instruction with the following -usage: - - +--------------+-------------+-------------------+----------------------------+ - |Usage |Code Sequence|Trap Handler Inputs|Description | - +==============+=============+===================+============================+ - |reserved |s_trap 0x00 | |Reserved by hardware. | - +--------------+-------------+-------------------+----------------------------+ - |HSA debugtrap |s_trap 0x01 |SGPR0-1: queue_ptr |Reserved for HSA debugtrap | - |(arg) | |VGPR0: arg |intrinsic (not implemented).| - +--------------+-------------+-------------------+----------------------------+ - |llvm.trap |s_trap 0x02 |SGPR0-1: queue_ptr |Causes dispatch to be | - | | | |terminated and its | - | | | |associated queue put into | - | | | |the error state. | - +--------------+-------------+-------------------+----------------------------+ - |llvm.debugtrap| s_trap 0x03 |SGPR0-1: queue_ptr |If debugger not installed | - | | | |handled same as llvm.trap. | - +--------------+-------------+-------------------+----------------------------+ - |debugger |s_trap 0x07 | |Reserved for debugger | - |breakpoint | | |breakpoints. | - +--------------+-------------+-------------------+----------------------------+ - |debugger |s_trap 0x08 | |Reserved for debugger. | - +--------------+-------------+-------------------+----------------------------+ - |debugger |s_trap 0xfe | |Reserved for debugger. | - +--------------+-------------+-------------------+----------------------------+ - |debugger |s_trap 0xff | |Reserved for debugger. | - +--------------+-------------+-------------------+----------------------------+ - -Non-HSA OS -^^^^^^^^^^ -For code objects generated by AMDGPU back-end for non-HSA OS, the runtime does -not install a trap handler. The llvm.trap and llvm.debugtrap instructions are -handler as follows: - - =============== ============= =============================================== - Usage Code Sequence Description - =============== ============= =============================================== - llvm.trap s_endpgm Causes wavefront to be terminated. - llvm.debugtrap Nothing Compiler warning generated that there is no trap handler installed. - =============== ============= =============================================== +~~~~~~~~~~~~~~~~ + +For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does +not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` +instructions are handled as follows: + + .. table:: AMDGPU Trap Handler for Non-AMDHSA OS + :name: amdgpu-trap-handler-for-non-amdhsa-os-table + + =============== =============== =========================================== + Usage Code Sequence Description + =============== =============== =========================================== + llvm.trap s_endpgm Causes wavefront to be terminated. + llvm.debugtrap *none* Compiler warning given that there is no + trap handler installed. + =============== =============== =========================================== + +Source Languages +================ + +.. _amdgpu-opencl: + +OpenCL +------ + +When generating code for the OpenCL language the target triple environment +should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`). + +When the language is OpenCL the following differences occur: + +1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). +2. The AMDGPU backend adds additional arguments to the kernel. +3. Additional metadata is generated (:ref:`amdgpu-code-object-metadata`). + +.. TODO + Specify what affect this has. Hidden arguments added. Additional metadata + generated. + +.. _amdgpu-hcc: + +HCC +--- + +When generating code for the OpenCL language the target triple environment +should be ``hcc`` (see :ref:`amdgpu-target-triples`). + +When the language is OpenCL the following differences occur: + +1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). + +.. TODO + Specify what affect this has. Assembler -========= +--------- AMDGPU backend has LLVM-MC based assembler which is currently in development. -It supports Southern Islands ISA, Sea Islands and Volcanic Islands. +It supports AMDGCN GFX6-GFX8. -This document describes general syntax for instructions and operands. For more -information about instructions, their semantics and supported combinations -of operands, refer to one of Instruction Set Architecture manuals. +This section describes general syntax for instructions and operands. For more +information about instructions, their semantics and supported combinations of +operands, refer to one of instruction set architecture manuals +[AMD-Souther-Islands]_ [AMD-Sea-Islands]_ [AMD-Volcanic-Islands]_. -An instruction has the following syntax (register operands are -normally comma-separated while extra operands are space-separated): +An instruction has the following syntax (register operands are normally +comma-separated while extra operands are space-separated): * , ... ...* - Operands --------- +~~~~~~~~ The following syntax for register operands is supported: @@ -140,8 +3467,11 @@ - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE) - abs, neg, sext -DS Instructions Examples ------------------------- +Instruction Examples +~~~~~~~~~~~~~~~~~~~~ + +DS +~~ .. code-block:: nasm @@ -153,8 +3483,8 @@ For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual. -FLAT Instruction Examples --------------------------- +FLAT +++++ .. code-block:: nasm @@ -166,8 +3496,8 @@ For full list of supported instructions, refer to "FLAT instructions" in ISA Manual. -MUBUF Instruction Examples ---------------------------- +MUBUF ++++++ .. code-block:: nasm @@ -179,8 +3509,8 @@ For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual. -SMRD/SMEM Instruction Examples -------------------------------- +SMRD/SMEM ++++++++++ .. code-block:: nasm @@ -192,8 +3522,8 @@ For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual. -SOP1 Instruction Examples --------------------------- +SOP1 +++++ .. code-block:: nasm @@ -207,8 +3537,8 @@ For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual. -SOP2 Instruction Examples -------------------------- +SOP2 +++++ .. code-block:: nasm @@ -224,8 +3554,8 @@ For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual. -SOPC Instruction Examples --------------------------- +SOPC +++++ .. code-block:: nasm @@ -236,8 +3566,8 @@ For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual. -SOPP Instruction Examples --------------------------- +SOPP +++++ .. code-block:: nasm @@ -259,8 +3589,8 @@ of SOPP Instructions, so it is up to the programmer to be familiar with the range or acceptable values. -Vector ALU Instruction Examples -------------------------------- +VALU +++++ For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), the assembler will automatically use optimal encoding based on its operands. @@ -314,19 +3644,20 @@ For full list of supported instructions, refer to "Vector ALU instructions". HSA Code Object Directives --------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~ AMDGPU ABI defines auxiliary data in output code object. In assembly source, one can specify them with assembler directives. .hsa_code_object_version major, minor -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ++++++++++++++++++++++++++++++++++++++ *major* and *minor* are integers that specify the version of the HSA code object that will be generated by the assembler. .hsa_code_object_isa [major, minor, stepping, vendor, arch] -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + *major*, *minor*, and *stepping* are all integers that describe the instruction set architecture (ISA) version of the assembly program. @@ -338,13 +3669,13 @@ from the value of the -mcpu option that is passed to the assembler. .amdgpu_hsa_kernel (name) -^^^^^^^^^^^^^^^^^^^^^^^^^ ++++++++++++++++++++++++++ This directives specifies that the symbol with given name is a kernel entry point (label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL. .amd_kernel_code_t -^^^^^^^^^^^^^^^^^^ +++++++++++++++++++ This directive marks the beginning of a list of key / value pairs that are used to specify the amd_kernel_code_t object that will be emitted by the assembler. @@ -403,3 +3734,25 @@ s_endpgm .Lfunc_end0: .size hello_world, .Lfunc_end0-hello_world + +Additional Documentation +======================== + +.. [AMD-R6xx] `AMD R6xx shader ISA `__ +.. [AMD-R7xx] `AMD R7xx shader ISA `__ +.. [AMD-Evergreen] `AMD Evergreen shader ISA `__ +.. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA `__ +.. [AMD-Souther-Islands] `AMD Southern Islands Series ISA `__ +.. [AMD-Sea-Islands] `AMD Sea Islands Series ISA `_ +.. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture `__ +.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide `_ +.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation `__ +.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing `__ +.. [AMD-ROCm-github] `ROCm github `__ +.. [HSA] `Heterogeneous System Architecture (HSA) Foundation `__ +.. [ELF] `Executable and Linkable Format (ELF) `__ +.. [DWARF] `DWARF Debugging Information Format `__ +.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 `__ +.. [OpenCL] `The OpenCL Specification Version 2.0 `__ +.. [HRF] `Heterogeneous-race-free Memory Models `__ +.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface `__ Index: docs/CodeGenerator.rst =================================================================== --- docs/CodeGenerator.rst +++ docs/CodeGenerator.rst @@ -2642,59 +2642,6 @@ The AMDGPU backend ------------------ -The AMDGPU code generator lives in the lib/Target/AMDGPU directory, and is an -open source native AMD GCN ISA code generator. - -Target triples supported -^^^^^^^^^^^^^^^^^^^^^^^^ - -The following are the known target triples that are supported by the AMDGPU -backend. - -* **amdgcn--** --- AMD GCN GPUs (AMDGPU.7.0.0+) -* **amdgcn--amdhsa** --- AMD GCN GPUs (AMDGPU.7.0.0+) with HSA support -* **r600--** --- AMD GPUs HD2XXX-HD6XXX - -Relocations -^^^^^^^^^^^ - -Supported relocatable fields are: - -* **word32** --- This specifies a 32-bit field occupying 4 bytes with arbitrary - byte alignment. These values use the same byte order as other word values in - the AMD GPU architecture -* **word64** --- This specifies a 64-bit field occupying 8 bytes with arbitrary - byte alignment. These values use the same byte order as other word values in - the AMD GPU architecture - -Following notations are used for specifying relocation calculations: - -* **A** --- Represents the addend used to compute the value of the relocatable - field -* **G** --- Represents the offset into the global offset table at which the - relocation entry’s symbol will reside during execution. -* **GOT** --- Represents the address of the global offset table. -* **P** --- Represents the place (section offset or address) of the storage unit - being relocated (computed using ``r_offset``) -* **S** --- Represents the value of the symbol whose index resides in the - relocation entry - -AMDGPU Backend generates *Elf64_Rela* relocation records with the following -supported relocation types: - - ========================== ===== ========== ============================== - Relocation type Value Field Calculation - ========================== ===== ========== ============================== - ``R_AMDGPU_NONE`` 0 ``none`` ``none`` - ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF - ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32 - ``R_AMDGPU_ABS64`` 3 ``word64`` S + A - ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P - ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P - ``R_AMDGPU_ABS32`` 6 ``word32`` S + A - ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P - ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF - ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32 - ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF - ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32 - ========================== ===== ========== ============================== +The AMDGPU code generator lives in the ``lib/Target/AMDGPU`` +directory. This code generator is capable of targeting a variety of +AMD GPU processors. Refer to :doc:`AMDGPUUsage` for more information. Index: docs/CompilerWriterInfo.rst =================================================================== --- docs/CompilerWriterInfo.rst +++ docs/CompilerWriterInfo.rst @@ -72,16 +72,7 @@ AMDGPU ------ -* `AMD R6xx shader ISA `_ -* `AMD R7xx shader ISA `_ -* `AMD Evergreen shader ISA `_ -* `AMD Cayman/Trinity shader ISA `_ -* `AMD Southern Islands Series ISA `_ -* `AMD Sea Islands Series ISA `_ -* `AMD GCN3 Instruction Set Architecture `__ -* `AMD GPU Programming Guide `_ -* `AMD Compute Resources `_ -* `AMDGPU Compute Application Binary Interface `__ +Refer to :doc:`AMDGPUUsage` for additional documentation. RISC-V ------ Index: docs/index.rst =================================================================== --- docs/index.rst +++ docs/index.rst @@ -360,10 +360,10 @@ Answers some questions about the new Attributes infrastructure. :doc:`NVPTXUsage` - This document describes using the NVPTX back-end to compile GPU kernels. + This document describes using the NVPTX backend to compile GPU kernels. :doc:`AMDGPUUsage` - This document describes how to use the AMDGPU back-end. + This document describes using the AMDGPU backend to compile GPU kernels. :doc:`StackMaps` LLVM support for mapping instruction addresses to the location of