Index: clang/docs/ClangOffloadBundlerFileFormat.rst =================================================================== --- /dev/null +++ clang/docs/ClangOffloadBundlerFileFormat.rst @@ -0,0 +1,89 @@ +Bundled Device Binaries +======================= + +The ``clang-offload-bundler`` tool can be used to combine multiple device +binaries into a single bundled device binary file. The bundled device binary +entries are identified by a bundle entry ID which is defined by the +following EBNF syntax: + +.. code:: + + ::= "-" [ "-" ] + +Where: + +**offload_kind** + The runtime responsible for managing the loading of the code object. + See :ref:`offload-kind-table`. + +**target_triple** + The target triple of the device binary. + +**processor_or_target_id** + If target ID is not supported by the target, this is the processor for which the device + binary is compiled. If target ID is supported by the target, this is the target ID for which the + device binary is compiled (for AMD GPU see `Target ID Definition `_). + + .. table:: Bundled Device Binary Offload Kind + :name: offload-kind-table + + ============= ============================================================== + Offload Kind Description + ============= ============================================================== + host This offload kind is used for the first dummy empty entry + in the header of the bundle, which is required by + clang-offload-bundler, but is not used by language runtimes. + + hip Device binary loading is managed by the HIP language runtime. + + openmp Device binary loading is managed by the OpenMP language runtime. + ============= ============================================================== + +The format of a bundled device binary is defined by the following table: + + .. table:: Bundled Device Binary Memory Layout + :name: bundled-device-binary-fields-table + + ========================= ======== ========================== =============================== + Field Type Size in Bytes Description + ========================= ======== ========================== =============================== + Magic String string 24 ``__CLANG_OFFLOAD_BUNDLE__`` + + Number Of Device Binaries integer 8 Denoted as N in this table + + Entry Offset 1 integer 8 Byte offset from beginning of + bundled device binary to 1st device + binary. + + Entry Size 1 integer 8 Byte size of 1st code object. + + Entry ID Length 1 integer 8 Bundle entry ID character length + of 1st device binary + + Entry ID 1 string Byte size of entry ID 1 Bundle entry ID of 1st device + binary. This is not NUL + terminated. + + ... + + Entry Offset N integer 8 + + Entry Size N integer 8 + + Entry ID Length N integer 8 + + Entry ID N string Byte size of entry ID N + + 1st Device Binary bytes Size Of 1st Device Binary + + ... + + N-th Device Binary bytes Size Of N-th Device Binary + ========================= ======== ========================== ============================== + +The ``clang-offload-bundler`` is used to bundle device binaries for different processor +and feature settings. + +If target ID is supported, the rules of compatible offload targets in a single bundled device binary is defined +by the target (for AMD GPU see `AMDGPU Embedding Bundled Code Objects +`_). \ No newline at end of file Index: clang/docs/index.rst =================================================================== --- clang/docs/index.rst +++ clang/docs/index.rst @@ -78,6 +78,7 @@ ClangFormat ClangFormatStyleOptions ClangFormattedStatus + ClangOffloadBundlerFileFormat Design Documents ================ Index: llvm/docs/AMDGPUUsage.rst =================================================================== --- llvm/docs/AMDGPUUsage.rst +++ llvm/docs/AMDGPUUsage.rst @@ -90,191 +90,182 @@ Processors ---------- -Use the ``clang -mcpu `` option to specify the AMDGPU processor. The -names from both the *Processor* and *Alternative Processor* can be used. +Use the ``clang -mcpu `` or ``clang -mcpu `` option to +specify the AMDGPU processor. The names from both the *Processor* and +*Alternative Processor* can be used. Target ID is defined in :ref:`target-id` +which includes the *Processor* and the *Target Features*. .. table:: AMDGPU Processors :name: amdgpu-processor-table - =========== =============== ============ ===== ================= ======= ====================== - Processor Alternative Target dGPU/ Target ROCm Example - Processor Triple APU Features Support Products + =========== =============== ============ ===== ======================= ======= ====================== + Processor Alternative Target dGPU/ Target ROCm Example + Processor Triple APU Features Support Products Architecture Supported - [Default] - =========== =============== ============ ===== ================= ======= ====================== + =========== =============== ============ ===== ======================= ======= ====================== **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ - ----------------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------------- ``r600`` ``r600`` dGPU ``r630`` ``r600`` dGPU ``rs880`` ``r600`` dGPU ``rv670`` ``r600`` dGPU **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ - ----------------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------------- ``rv710`` ``r600`` dGPU ``rv730`` ``r600`` dGPU ``rv770`` ``r600`` dGPU **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ - ----------------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------------- ``cedar`` ``r600`` dGPU ``cypress`` ``r600`` dGPU ``juniper`` ``r600`` dGPU ``redwood`` ``r600`` dGPU ``sumo`` ``r600`` dGPU **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ - ----------------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------------- ``barts`` ``r600`` dGPU ``caicos`` ``r600`` dGPU ``cayman`` ``r600`` dGPU ``turks`` ``r600`` dGPU **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ - ----------------------------------------------------------------------------------------------- + ----------------------------------------------------------------------------------------------------- ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU ``gfx601`` - ``hainan`` ``amdgcn`` dGPU - ``oland`` - ``pitcairn`` - ``verde`` **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ - ----------------------------------------------------------------------------------------------- - ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 - - A6 Pro-7050B - - A8-7100 - - A8 Pro-7150B - - A10-7300 - - A10 Pro-7350B - - FX-7500 - - A8-7200P - - A10-7400P - - FX-7600P - ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 - - FirePro W9100 - - FirePro S9150 - - FirePro S9170 - ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 - - Radeon R9 290x - - Radeon R390 - - Radeon R390x - ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 - - ``mullins`` - E1-2200 - - E1-2500 - - E2-3000 - - E2-3800 - - A4-5000 - - A4-5100 - - A6-5200 - - A4 Pro-3340B - ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 - - Radeon HD 8770 - - R7 260 - - R7 260X + ----------------------------------------------------------------------------------------------------- + ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 + - A6 Pro-7050B + - A8-7100 + - A8 Pro-7150B + - A10-7300 + - A10 Pro-7350B + - FX-7500 + - A8-7200P + - A10-7400P + - FX-7600P + ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 + - FirePro W9100 + - FirePro S9150 + - FirePro S9170 + ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 + - Radeon R9 290x + - Radeon R390 + - Radeon R390x + ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 + - ``mullins`` - E1-2200 + - E1-2500 + - E2-3000 + - E2-3800 + - A4-5000 + - A4-5100 + - A6-5200 + - A4 Pro-3340B + ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 + - Radeon HD 8770 + - R7 260 + - R7 260X **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ - ----------------------------------------------------------------------------------------------- - ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P - [on] - Pro A6-8500B - - A8-8600P - - Pro A8-8600B - - FX-8800P - - Pro A12-8800B - \ ``amdgcn`` APU - xnack ROCm - A10-8700P - [on] - Pro A10-8700B - - A10-8780P - \ ``amdgcn`` APU - xnack - A10-9600P - [on] - A10-9630P - - A12-9700P - - A12-9730P - - FX-9800P - - FX-9830P - \ ``amdgcn`` APU - xnack - E2-9010 - [on] - A6-9210 - - A9-9410 - ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 - - ``tonga`` [off] - FirePro S7100 - - FirePro W7100 - - Radeon R285 - - Radeon R9 380 - - Radeon R9 385 - - Mobile FirePro - M7170 - ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano - [off] - Radeon R9 Fury - - Radeon R9 FuryX - - Radeon Pro Duo - - FirePro S9300x2 - - Radeon Instinct MI8 - \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 - [off] - Radeon RX 480 - - Radeon Instinct MI6 - \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 - [off] + ----------------------------------------------------------------------------------------------------- + ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P + - Pro A6-8500B + - A8-8600P + - Pro A8-8600B + - FX-8800P + - Pro A12-8800B + \ ``amdgcn`` APU - xnack ROCm - A10-8700P + - Pro A10-8700B + - A10-8780P + \ ``amdgcn`` APU - xnack - A10-9600P + - A10-9630P + - A12-9700P + - A12-9730P + - FX-9800P + - FX-9830P + \ ``amdgcn`` APU - xnack - E2-9010 + - A6-9210 + - A9-9410 + ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 + - ``tonga`` - FirePro S7100 + - FirePro W7100 + - Radeon R285 + - Radeon R9 380 + - Radeon R9 385 + - Mobile FirePro + M7170 + ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano + - Radeon R9 Fury + - Radeon R9 FuryX + - Radeon Pro Duo + - FirePro S9300x2 + - Radeon Instinct MI8 + \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 + - Radeon RX 480 + - Radeon Instinct MI6 + \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 + ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack - [on] + **GCN GFX9** [AMD-GCN-GFX9]_ - ----------------------------------------------------------------------------------------------- - ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega - [off] Frontier Edition - - Radeon RX Vega 56 - - Radeon RX Vega 64 - - Radeon RX Vega 64 - Liquid - - Radeon Instinct MI25 - ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G - [on] - Ryzen 5 2400G - ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* - [off] - .. TODO:: - Add product - names. - ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50 - [off] - Radeon Instinct MI60 - - sram-ecc - Radeon VII - [off] - Radeon Pro VII - ``gfx908`` ``amdgcn`` dGPU - xnack *TBA* - [off] - - sram-ecc - [on] - .. TODO:: - Add product - names. - ``gfx909`` ``amdgcn`` APU - xnack *TBA* - [on] - .. TODO:: - Add product - names. + ----------------------------------------------------------------------------------------------------- + ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega + Frontier Edition + - Radeon RX Vega 56 + - Radeon RX Vega 64 + - Radeon RX Vega 64 + Liquid + - Radeon Instinct MI25 + ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G + - Ryzen 5 2400G + ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* + + .. TODO:: + Add product + names. + ``gfx906`` ``amdgcn`` dGPU - sramecc - Radeon Instinct MI50 + - xnack - Radeon Instinct MI60 + - Radeon VII + - Radeon Pro VII + ``gfx908`` ``amdgcn`` dGPU - sramecc *TBA* + - xnack + .. TODO:: + Add product + names. + ``gfx909`` ``amdgcn`` APU - xnack *TBA* + .. TODO:: + Add product + names. **GCN GFX10** [AMD-GCN-GFX10]_ - ----------------------------------------------------------------------------------------------- - ``gfx1010`` ``amdgcn`` dGPU - xnack - Radeon RX 5700 - [off] - Radeon RX 5700 XT - - wavefrontsize64 - Radeon Pro 5600 XT - [off] - - cumode - [off] - ``gfx1011`` ``amdgcn`` dGPU - xnack - Radeon Pro 5600M - [off] + ----------------------------------------------------------------------------------------------------- + ``gfx1010`` ``amdgcn`` dGPU - cumode - Radeon RX 5700 + - wavefrontsize64 - Radeon RX 5700 XT + - xnack - Radeon Pro 5600 XT + + ``gfx1011`` ``amdgcn`` dGPU - cumode - Radeon Pro 5600M - wavefrontsize64 - [off] - - cumode - [off] - ``gfx1012`` ``amdgcn`` dGPU - xnack - Radeon RX 5500 - [off] - Radeon RX 5500 XT + - xnack + + ``gfx1012`` ``amdgcn`` dGPU - cumode - Radeon RX 5500 + - wavefrontsize64 - Raedon RX 5500 XT + - xnack + + ``gfx1030`` ``amdgcn`` dGPU - cumode *TBA* - wavefrontsize64 - [off] - - cumode - [off] - ``gfx1030`` ``amdgcn`` dGPU - wavefrontsize64 *TBA* - [off] - - cumode - [off] - .. TODO - Add product - names. - ``gfx1031`` ``amdgcn`` dGPU - wavefrontsize64 *TBA* - [off] - - cumode - [off] - .. TODO - Add product - names. + .. TODO:: + Add product + names. + + ``gfx1031`` ``amdgcn`` dGPU - cumode *TBA* + - wavefrontsize64 + .. TODO:: + Add product + names. =========== =============== ============ ===== ================= ======= ====================== +See :ref:`target-features` and :ref:`target-id` for more information on target features. + .. _amdgpu-target-features: Target Features @@ -287,55 +278,143 @@ generating the code. A mismatch of features may result in incorrect execution, or a reduction in performance. -The target features supported by each processor, and the default value -used if not specified explicitly, is listed in +The target features supported by each processor is listed in :ref:`amdgpu-processor-table`. -Use the ``clang -m[no-]`` option to specify the AMDGPU -target features. +Target features are controlled by exactly one of the following ``clang`` +options: + +``-mcpu=`` + + The ``-mcpu`` can specify the target feature as optional components + of the target ID. See :ref:`_amdgpu-target-ids`. If omitted, the target + feature has the ``any`` value. + +``-m[no-]`` + + Target features not specified by the target ID are specified using a + separate option. These target features can have an ``on`` or ``off`` + value. ``on`` is specified by omitting the ``no-`` prefix, and + ``off`` is specified by including the ``no-`` prefix. The default + if not specified is ``off``. For example: -``-mxnack`` +``-mcpu gfx908:xnack+`` Enable the ``xnack`` feature. -``-mno-xnack`` +``-mcpu gfx908:xnack-`` Disable the ``xnack`` feature. +``-mcumode`` + Enable the ``cumode`` feature. +``-mno-cumode`` + Disable the ``cumode`` feature. .. table:: AMDGPU Target Features :name: amdgpu-target-feature-table - ====================== ================================================== - Target Feature Description - ====================== ================================================== - -m[no-]xnack Enable/disable generating code that has - memory clauses that are compatible with - having XNACK replay enabled. - - This is used for demand paging and page - migration. If XNACK replay is enabled in - the device, then if a page fault occurs - the code may execute incorrectly if the - ``xnack`` feature is not enabled. Executing - code that has the feature enabled on a - device that does not have XNACK replay - enabled will execute correctly but may - be less performant than code with the - feature disabled. - - -m[no-]sram-ecc Enable/disable generating code that assumes SRAM - ECC is enabled/disabled. - - -m[no-]wavefrontsize64 Control the default wavefront size used when - generating code for kernels. When disabled - native wavefront size 32 is used, when enabled - wavefront size 64 is used. - - -m[no-]cumode Control the default wavefront execution mode used - when generating code for kernels. When disabled - native WGP wavefront execution mode is used, - when enabled CU wavefront execution mode is used - (see :ref:`amdgpu-amdhsa-memory-model`). - ====================== ================================================== + ====================== ======================= ================================================== + Target Feature ``Clang`` Option to Description + Name Control + ====================== ======================= ================================================== + cumode -m[no-]cumode Control the wavefront execution mode used + when generating code for kernels. When disabled + native WGP wavefront execution mode is used, + when enabled CU wavefront execution mode is used + (see :ref:`amdgpu-amdhsa-memory-model`). + + sramecc -mcpu If specified, generate code that can only be + loaded and executed in a process that has a + matching setting for SRAM ECC. + + If not specified, generate code that can be + loaded and executed in a process with either + setting of SRAM ECC. + + wavefrontsize64 -m[no-]wavefrontsize64 Control the wavefront size used when + generating code for kernels. When disabled + native wavefront size 32 is used, when enabled + wavefront size 64 is used. + + xnack -mcpu If specified, generate code that can only be + loaded and executed in a process that has a + matching setting for XNACK replay. + + If not specified, generate code that can be + loaded and executed in a process with either + setting of XNACK replay. + + This is used for demand paging and page + migration. If XNACK replay is enabled in + the device, then if a page fault occurs + the code may execute incorrectly if the + ``xnack`` feature is not enabled. Executing + code that has the feature enabled on a + device that does not have XNACK replay + enabled will execute correctly but may + be less performant than code with the + feature disabled. + ====================== ======================= ================================================== + +.. _amdgpu-target-ids: + +Target IDs +---------- +A target ID is used to indicate the processor configuration a device binary is +compiled for. It can be treated as an extension of processor since the validity of a +device binary depends not only on the processor but also its configuration +which is represented by a set of target features. Target ID provides a way to +represent processor configurations which affect ISA generation. + +Target ID syntax is defined by the following EBNF syntax: + +.. code:: + + ::= ( ":" ( "+" | "-" ) )* + +Where: + +**processor** + Is a AMDGPU processor or alternative processor name specified + in :ref:`amdgpu-processor-table`. + +**target_feature** + Is a target feature name specified in :ref:`target-features-table` that is + supported by the processor. The target features supported by each processor + is specified in :ref:`amdgpu-processor-table`. Each target feature must + appear at most once in a target ID and can have one of three values: + + *Any* + Specified by omitting the target feature from the target ID. + A code object compiled with a target ID specifying the default + value of a target feature can be loaded and executed on a processor + configured with the target feature on or off. + + *On* + Specified by ``+``, indicating the target feature is enabled. A code + object compiled with a target ID specifying a target feature on + can only be loaded on a processor configured with the target feature on. + + *Off* + specified by ``-``, indicating the target feature is disabled. A code + object compiled with a target ID specifying a target feature off + can only be loaded on a processor configured with the target feature off. + +There are two forms of target ID: + +*Non-Canonical Form* + The non-canonical form is used as the input to user commands to allow + the user greater convenience. It allows both the primary and alternative + processor name to be used (see :ref:`amdgpu-processors`) and the target + features may be specified in any order (see :ref:`amdgpu-target-features`). + +*Canonical Form* + The canonical form is used for all generated output to allow greater + convenience for tools that consume the information. It is also used for + internal passing of information between tools. Only the primary and not + alternative processor name is used (see :ref:`amdgpu-processors`) and + the target features are specified in alphabetic order + (see :ref:`amdgpu-target-features`). Command line tools convert + non-canonical form to canonical form. .. _amdgpu-address-spaces: @@ -647,7 +726,8 @@ - ``ET_DYN`` ``e_machine`` ``EM_AMDGPU`` ``e_entry`` 0 - ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table` + ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table-v0_v1` + and :ref:`amdgpu-elf-header-e_flags-table-v2` ========================== =============================== .. @@ -655,18 +735,20 @@ .. table:: AMDGPU ELF Header Enumeration Values :name: amdgpu-elf-header-enumeration-values-table - =============================== ===== + =============================== ====== Name Value - =============================== ===== + =============================== ====== ``EM_AMDGPU`` 224 ``ELFOSABI_NONE`` 0 ``ELFOSABI_AMDGPU_HSA`` 64 ``ELFOSABI_AMDGPU_PAL`` 65 ``ELFOSABI_AMDGPU_MESA3D`` 66 - ``ELFABIVERSION_AMDGPU_HSA`` 1 + ``ELFABIVERSION_AMDGPU_HSA_V0`` 0 + ``ELFABIVERSION_AMDGPU_HSA_V1`` 1 + ``ELFABIVERSION_AMDGPU_HSA_V2`` 2 ``ELFABIVERSION_AMDGPU_PAL`` 0 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 - =============================== ===== + =============================== ====== ``e_ident[EI_CLASS]`` The ELF class is: @@ -722,7 +804,8 @@ by the ``r600`` and ``amdgcn`` architectures (see :ref:`amdgpu-processor-table`). The specific processor is specified in the ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see - :ref:`amdgpu-elf-header-e_flags-table`). + :ref:`amdgpu-elf-header-e_flags-table-v0_v1` and + :ref:`amdgpu-elf-header-e_flags-table-v2`). ``e_entry`` The entry point is 0 as the entry points for individual kernels must be @@ -731,8 +814,8 @@ ``e_flags`` The AMDGPU backend uses the following ELF header flags: - .. table:: AMDGPU ELF Header ``e_flags`` - :name: amdgpu-elf-header-e_flags-table + .. table:: AMDGPU ELF Header ``e_flags`` (``EI_ABIVERSION_V0`` and ``EI_ABIVERSION_V1``) + :name: amdgpu-elf-header-e_flags-table-v0_v1 ================================= ========== ============================= Name Value Description @@ -744,7 +827,7 @@ ``EF_AMDGPU_MACH_xxx`` values defined in :ref:`amdgpu-ef-amdgpu-mach-table`. - ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack`` + ``EF_AMDGPU_FEATURE_XNACK_V3`` 0x00000100 Indicates if the ``xnack`` target feature is enabled for all code contained in the code object. @@ -755,19 +838,42 @@ be 0. See :ref:`amdgpu-target-features`. - ``EF_AMDGPU_SRAM_ECC`` 0x00000200 Indicates if the ``sram-ecc`` + ``EF_AMDGPU_FEATURE_SRAMECC_V3`` 0x00000200 Indicates if the ``sramecc`` target feature is enabled for all code contained in the code object. If the processor does not support the - ``sram-ecc`` target + ``sramecc`` target feature then must be 0. See :ref:`amdgpu-target-features`. ================================= ========== ============================= + .. table:: AMDGPU ELF Header ``e_flags`` (``EI_ABIVERSION_V2``) + :name: amdgpu-elf-header-e_flags-table-v2 + + ================================= ========== ========================================== + Name Value Description + ================================= ========== ========================================== + **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`. + -------------------------------------------- ------------------------------------------ + ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection + mask for + ``EF_AMDGPU_MACH_xxx`` values + defined in + :ref:`amdgpu-ef-amdgpu-mach-table`. + ``EF_AMDGPU_FEATURE_XNACK_V4`` 0x00000300 XNACK selection mask for + ``EF_AMDGPU_FEATURE_XNACK_xxx`` values + defined in + :ref:`amdgpu-ef-amdgpu-feature-xnack-table`. + ``EF_AMDGPU_FEATURE_SRAMECC_V4`` 0x00000c00 SRAMECC selection mask for + ``EF_AMDGPU_FEATURE_SRAMECC_xxx`` values + defined in + :ref:`amdgpu-ef-amdgpu-feature-sramecc-table`. + ================================= ========== ========================================== + .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values :name: amdgpu-ef-amdgpu-mach-table @@ -820,6 +926,30 @@ ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` ================================= ========== ============================= + .. table:: AMDGPU ``EF_AMDGPU_FEATURE_XNACK`` Values (see :ref:`amdgpu-target-features`) + :name: amdgpu-ef-amdgpu-feature-xnack-table + + ============================================= ===== + Name Value + ============================================= ===== + ``EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4`` 0x0 + ``EF_AMDGPU_FEATURE_XNACK_ANY_V4`` 0x1 + ``EF_AMDGPU_FEATURE_XNACK_OFF_V4`` 0x2 + ``EF_AMDGPU_FEATURE_XNACK_ON_V4`` 0x3 + ============================================= ===== + + .. table:: AMDGPU ``EF_AMDGPU_FEATURE_SRAMECC`` Values (see :ref:`amdgpu-target-features`) + :name: amdgpu-ef-amdgpu-feature-sramecc-table + + =============================================== ===== + Name Value + =============================================== ===== + ``EF_AMDGPU_FEATURE_SRAMECC_UNSUPPORTED_V4`` 0x0 + ``EF_AMDGPU_FEATURE_SRAMECC_ANY_V4`` 0x1 + ``EF_AMDGPU_FEATURE_SRAMECC_OFF_V4`` 0x2 + ``EF_AMDGPU_FEATURE_SRAMECC_ON_V4`` 0x3 + =============================================== ===== + Sections -------- @@ -1161,6 +1291,26 @@ file:///dir3/dir4/file2#offset=0x2000&size=3000 memory://1234#offset=0x20000&size=3000 +.. _amdgpu-embedding-bundled-objects: + +Embedding Bundled Code Objects +============================== + +Use one or more ``--offload-arch=`` clang options to specify the +target IDs of the offload code regions of a single source programing language. + +The compiler will perform a separate compilation for the host and a separate +compilation for the offload code regions for each specified target ID. The +``clang-offload-bundler`` is used to bundle the offload code objects +(see `ClangOffloadBundlerFileFormat`_). +The bundled code object is embedded in the host code object as a data section +with the name ``.hip_fatbin``. + +The host compilation includes an ``init`` function that will use the runtime +corresponding to the offload kind (see :ref:`amdgpu-offload-kind-table`) to +load the offload code objects appropriate to the devices present when the +host program is executed. + .. _amdgpu-dwarf-debug-information: DWARF Debug Information @@ -2506,45 +2656,46 @@ .. table:: AMDHSA Code Object V3 Metadata Map :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 - ================= ============== ========= ======================================= - String Key Value Type Required? Description - ================= ============== ========= ======================================= - "amdhsa.version" sequence of Required - The first integer is the major - 2 integers version. Currently 1. - - The second integer is the minor - version. Currently 0. - "amdhsa.printf" sequence of Each string is encoded information - strings about a printf function call. The - encoded information is organized as - fields separated by colon (':'): - - ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` - - where: - - ``ID`` - A 32-bit integer as a unique id for - each printf function call - - ``N`` - A 32-bit integer equal to the number - of arguments of printf function call - minus 1 - - ``S[i]`` (where i = 0, 1, ... , N-1) - 32-bit integers for the size in bytes - of the i-th FormatString argument of - the printf function call - - FormatString - The format string passed to the - printf function call. - "amdhsa.kernels" sequence of Required Sequence of the maps for each - map kernel in the code object. See + ================== ============== ========= ======================================= + String Key Value Type Required? Description + ================== ============== ========= ======================================= + "amdhsa.version" sequence of Required - The first integer is the major + 2 integers version. Currently 1. + - The second integer is the minor + version. Currently 0. + "amdhsa.printf" sequence of Each string is encoded information + strings about a printf function call. The + encoded information is organized as + fields separated by colon (':'): + + ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` + + where: + + ``ID`` + A 32-bit integer as a unique id for + each printf function call + + ``N`` + A 32-bit integer equal to the number + of arguments of printf function call + minus 1 + + ``S[i]`` (where i = 0, 1, ... , N-1) + 32-bit integers for the size in bytes + of the i-th FormatString argument of + the printf function call + + FormatString + The format string passed to the + printf function call. + "amdhsa.kernels" sequence of Required Sequence of the maps for each + map kernel in the code object. See :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` for the definition of the keys included in that map. - ================= ============== ========= ======================================= + "amdhsa.target_id" string Required "-" + ================== ============== ========= ======================================= .. @@ -7523,14 +7674,13 @@ ``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and :ref:`amdgpu-processors`. -.amdgcn_target -+++++++++++++++++++++++ +.amdgcn_target "-" +++++++++++++++++++++++++++++++++++++++++++++++ -Optional directive which declares the target supported by the containing +Optional directive which declares the "-" supported by the containing assembler source file. Valid values are described in :ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler -to validate command-line options such as ``-triple``, ``-mcpu``, and those -which specify target features. +to validate command-line options such as ``-triple`` and ``-mcpu``. .amdhsa_kernel +++++++++++++++++++++ @@ -7606,10 +7756,6 @@ scratch memory. Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. - ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. - Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in - Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. - (+xnack) ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. Possible values are defined in Index: llvm/docs/AMDGPUUsage_V3.rst =================================================================== --- /dev/null +++ llvm/docs/AMDGPUUsage_V3.rst @@ -0,0 +1,7823 @@ +============================= +User Guide for AMDGPU Backend +============================= + +This version is for code object v3. For the latest verson please see :doc:`AMDGPUUsage`. + +.. contents:: + :local: + +.. toctree:: + :hidden: + + AMDGPU/AMDGPUAsmGFX7 + AMDGPU/AMDGPUAsmGFX8 + AMDGPU/AMDGPUAsmGFX9 + AMDGPU/AMDGPUAsmGFX900 + AMDGPU/AMDGPUAsmGFX904 + AMDGPU/AMDGPUAsmGFX906 + AMDGPU/AMDGPUAsmGFX908 + AMDGPU/AMDGPUAsmGFX10 + AMDGPU/AMDGPUAsmGFX1011 + AMDGPUModifierSyntax + AMDGPUOperandSyntax + AMDGPUInstructionSyntax + AMDGPUInstructionNotation + AMDGPUDwarfExtensionsForHeterogeneousDebugging + +Introduction +============ + +The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the +R600 family up until the current GCN families. It lives in the +``llvm/lib/Target/AMDGPU`` directory. + +LLVM +==== + +.. _amdgpu-target-triples: + +Target Triples +-------------- + +Use the ``clang -target ---`` option to +specify the target triple: + + .. table:: AMDGPU Architectures + :name: amdgpu-architecture-table + + ============ ============================================================== + Architecture Description + ============ ============================================================== + ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. + ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. + ============ ============================================================== + + .. table:: AMDGPU Vendors + :name: amdgpu-vendor-table + + ============ ============================================================== + Vendor Description + ============ ============================================================== + ``amd`` Can be used for all AMD GPU usage. + ``mesa3d`` Can be used if the OS is ``mesa3d``. + ============ ============================================================== + + .. table:: AMDGPU Operating Systems + :name: amdgpu-os-table + + ============== ============================================================ + OS Description + ============== ============================================================ + ** Defaults to the *unknown* OS. + ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes + such as AMD's ROCm [AMD-ROCm]_. + ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL + runtime. + ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D + runtime. + ============== ============================================================ + + .. table:: AMDGPU Environments + :name: amdgpu-environment-table + + ============ ============================================================== + Environment Description + ============ ============================================================== + ** Default. + ============ ============================================================== + +.. _amdgpu-processors: + +Processors +---------- + +Use the ``clang -mcpu `` option to specify the AMDGPU processor. The +names from both the *Processor* and *Alternative Processor* can be used. + + .. table:: AMDGPU Processors + :name: amdgpu-processor-table + + =========== =============== ============ ===== ================= ======= ====================== + Processor Alternative Target dGPU/ Target ROCm Example + Processor Triple APU Features Support Products + Architecture Supported + [Default] + =========== =============== ============ ===== ================= ======= ====================== + **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ + ----------------------------------------------------------------------------------------------- + ``r600`` ``r600`` dGPU + ``r630`` ``r600`` dGPU + ``rs880`` ``r600`` dGPU + ``rv670`` ``r600`` dGPU + **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ + ----------------------------------------------------------------------------------------------- + ``rv710`` ``r600`` dGPU + ``rv730`` ``r600`` dGPU + ``rv770`` ``r600`` dGPU + **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ + ----------------------------------------------------------------------------------------------- + ``cedar`` ``r600`` dGPU + ``cypress`` ``r600`` dGPU + ``juniper`` ``r600`` dGPU + ``redwood`` ``r600`` dGPU + ``sumo`` ``r600`` dGPU + **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ + ----------------------------------------------------------------------------------------------- + ``barts`` ``r600`` dGPU + ``caicos`` ``r600`` dGPU + ``cayman`` ``r600`` dGPU + ``turks`` ``r600`` dGPU + **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ + ----------------------------------------------------------------------------------------------- + ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU + ``gfx601`` - ``hainan`` ``amdgcn`` dGPU + - ``oland`` + - ``pitcairn`` + - ``verde`` + **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ + ----------------------------------------------------------------------------------------------- + ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 + - A6 Pro-7050B + - A8-7100 + - A8 Pro-7150B + - A10-7300 + - A10 Pro-7350B + - FX-7500 + - A8-7200P + - A10-7400P + - FX-7600P + ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 + - FirePro W9100 + - FirePro S9150 + - FirePro S9170 + ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 + - Radeon R9 290x + - Radeon R390 + - Radeon R390x + ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 + - ``mullins`` - E1-2200 + - E1-2500 + - E2-3000 + - E2-3800 + - A4-5000 + - A4-5100 + - A6-5200 + - A4 Pro-3340B + ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 + - Radeon HD 8770 + - R7 260 + - R7 260X + **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ + ----------------------------------------------------------------------------------------------- + ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P + [on] - Pro A6-8500B + - A8-8600P + - Pro A8-8600B + - FX-8800P + - Pro A12-8800B + \ ``amdgcn`` APU - xnack ROCm - A10-8700P + [on] - Pro A10-8700B + - A10-8780P + \ ``amdgcn`` APU - xnack - A10-9600P + [on] - A10-9630P + - A12-9700P + - A12-9730P + - FX-9800P + - FX-9830P + \ ``amdgcn`` APU - xnack - E2-9010 + [on] - A6-9210 + - A9-9410 + ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 + - ``tonga`` [off] - FirePro S7100 + - FirePro W7100 + - Radeon R285 + - Radeon R9 380 + - Radeon R9 385 + - Mobile FirePro + M7170 + ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano + [off] - Radeon R9 Fury + - Radeon R9 FuryX + - Radeon Pro Duo + - FirePro S9300x2 + - Radeon Instinct MI8 + \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 + [off] - Radeon RX 480 + - Radeon Instinct MI6 + \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 + [off] + ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack + [on] + **GCN GFX9** [AMD-GCN-GFX9]_ + ----------------------------------------------------------------------------------------------- + ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega + [off] Frontier Edition + - Radeon RX Vega 56 + - Radeon RX Vega 64 + - Radeon RX Vega 64 + Liquid + - Radeon Instinct MI25 + ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G + [on] - Ryzen 5 2400G + ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* + [off] + .. TODO:: + Add product + names. + ``gfx906`` ``amdgcn`` dGPU - xnack - Radeon Instinct MI50 + [off] - Radeon Instinct MI60 + - sram-ecc - Radeon VII + [on] - Radeon Pro VII + ``gfx908`` ``amdgcn`` dGPU - xnack *TBA* + [off] + - sram-ecc + [on] + .. TODO:: + Add product + names. + ``gfx909`` ``amdgcn`` APU - xnack *TBA* + [on] + .. TODO:: + Add product + names. + **GCN GFX10** [AMD-GCN-GFX10]_ + ----------------------------------------------------------------------------------------------- + ``gfx1010`` ``amdgcn`` dGPU - xnack - Radeon RX 5700 + [off] - Radeon RX 5700 XT + - wavefrontsize64 - Radeon Pro 5600 XT + [off] + - cumode + [off] + ``gfx1011`` ``amdgcn`` dGPU - xnack - Radeon Pro 5600M + [off] + - wavefrontsize64 + [off] + - cumode + [off] + ``gfx1012`` ``amdgcn`` dGPU - xnack - Radeon RX 5500 + [off] - Radeon RX 5500 XT + - wavefrontsize64 + [off] + - cumode + [off] + ``gfx1030`` ``amdgcn`` dGPU - wavefrontsize64 *TBA* + [off] + - cumode + [off] + .. TODO + Add product + names. + ``gfx1031`` ``amdgcn`` dGPU - wavefrontsize64 *TBA* + [off] + - cumode + [off] + .. TODO + Add product + names. + =========== =============== ============ ===== ================= ======= ====================== + +.. _amdgpu-target-features: + +Target Features +--------------- + +Target features control how code is generated to support certain +processor specific features. Not all target features are supported by +all processors. The runtime must ensure that the features supported by +the device used to execute the code match the features enabled when +generating the code. A mismatch of features may result in incorrect +execution, or a reduction in performance. + +The target features supported by each processor, and the default value +used if not specified explicitly, is listed in +:ref:`amdgpu-processor-table`. + +Use the ``clang -m[no-]`` option to specify the AMDGPU +target features. + +For example: + +``-mxnack`` + Enable the ``xnack`` feature. +``-mno-xnack`` + Disable the ``xnack`` feature. + + .. table:: AMDGPU Target Features + :name: amdgpu-target-feature-table + + ====================== ================================================== + Target Feature Description + ====================== ================================================== + -m[no-]xnack Enable/disable generating code that has + memory clauses that are compatible with + having XNACK replay enabled. + + This is used for demand paging and page + migration. If XNACK replay is enabled in + the device, then if a page fault occurs + the code may execute incorrectly if the + ``xnack`` feature is not enabled. Executing + code that has the feature enabled on a + device that does not have XNACK replay + enabled will execute correctly but may + be less performant than code with the + feature disabled. + + -m[no-]sram-ecc Enable/disable generating code that assumes SRAM + ECC is enabled/disabled. + + -m[no-]wavefrontsize64 Control the default wavefront size used when + generating code for kernels. When disabled + native wavefront size 32 is used, when enabled + wavefront size 64 is used. + + -m[no-]cumode Control the default wavefront execution mode used + when generating code for kernels. When disabled + native WGP wavefront execution mode is used, + when enabled CU wavefront execution mode is used + (see :ref:`amdgpu-amdhsa-memory-model`). + ====================== ================================================== + +.. _amdgpu-address-spaces: + +Address Spaces +-------------- + +The AMDGPU architecture supports a number of memory address spaces. The address +space names use the OpenCL standard names, with some additions. + +The AMDGPU address spaces correspond to target architecture specific LLVM +address space numbers used in LLVM IR. + +The AMDGPU address spaces are described in +:ref:`amdgpu-address-spaces-table`. Only 64-bit process address spaces are +supported for the ``amdgcn`` target. + + .. table:: AMDGPU Address Spaces + :name: amdgpu-address-spaces-table + + ================================= =============== =========== ================ ======= ============================ + .. 64-Bit Process Address Space + --------------------------------- --------------- ----------- ---------------- ------------------------------------ + Address Space Name LLVM IR Address HSA Segment Hardware Address NULL Value + Space Number Name Name Size + ================================= =============== =========== ================ ======= ============================ + Generic 0 flat flat 64 0x0000000000000000 + Global 1 global global 64 0x0000000000000000 + Region 2 N/A GDS 32 *not implemented for AMDHSA* + Local 3 group LDS 32 0xFFFFFFFF + Constant 4 constant *same as global* 64 0x0000000000000000 + Private 5 private scratch 32 0xFFFFFFFF + Constant 32-bit 6 *TODO* 0x00000000 + Buffer Fat Pointer (experimental) 7 *TODO* + ================================= =============== =========== ================ ======= ============================ + +**Generic** + The generic address space uses the hardware flat address support available in + GFX7-GFX10. This uses two fixed ranges of virtual addresses (the private and + local apertures), that are outside the range of addressable global memory, to + map from a flat address to a private or local address. + + FLAT instructions can take a flat address and access global, private + (scratch), and group (LDS) memory depending on if the address is within one + of the aperture ranges. Flat access to scratch requires hardware aperture + setup and setup in the kernel prologue (see + :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). Flat access to LDS requires + hardware aperture setup and M0 (GFX7-GFX8) register setup (see + :ref:`amdgpu-amdhsa-kernel-prolog-m0`). + + To convert between a private or group address space address (termed a segment + address) and a flat address the base address of the corresponding aperture + can be used. For GFX7-GFX8 these are available in the + :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with + Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For + GFX9-GFX10 the aperture base addresses are directly available as inline + constant registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. + In 64-bit address mode the aperture sizes are 2^32 bytes and the base is + aligned to 2^32 which makes it easier to convert from flat to segment or + segment to flat. + + A global address space address has the same value when used as a flat address + so no conversion is needed. + +**Global and Constant** + The global and constant address spaces both use global virtual addresses, + which are the same virtual address space used by the CPU. However, some + virtual addresses may only be accessible to the CPU, some only accessible + by the GPU, and some by both. + + Using the constant address space indicates that the data will not change + during the execution of the kernel. This allows scalar read instructions to + be used. The vector and scalar L1 caches are invalidated of volatile data + before each kernel dispatch execution to allow constant memory to change + values between kernel dispatches. + +**Region** + The region address space uses the hardware Global Data Store (GDS). All + wavefronts executing on the same device will access the same memory for any + given region address. However, the same region address accessed by wavefronts + executing on different devices will access different memory. It is higher + performance than global memory. It is allocated by the runtime. The data + store (DS) instructions can be used to access it. + +**Local** + The local address space uses the hardware Local Data Store (LDS) which is + automatically allocated when the hardware creates the wavefronts of a + work-group, and freed when all the wavefronts of a work-group have + terminated. All wavefronts belonging to the same work-group will access the + same memory for any given local address. However, the same local address + accessed by wavefronts belonging to different work-groups will access + different memory. It is higher performance than global memory. The data store + (DS) instructions can be used to access it. + +**Private** + The private address space uses the hardware scratch memory support which + automatically allocates memory when it creates a wavefront and frees it when + a wavefronts terminates. The memory accessed by a lane of a wavefront for any + given private address will be different to the memory accessed by another lane + of the same or different wavefront for the same private address. + + If a kernel dispatch uses scratch, then the hardware allocates memory from a + pool of backing memory allocated by the runtime for each wavefront. The lanes + of the wavefront access this using dword (4 byte) interleaving. The mapping + used from private address to backing memory address is: + + ``wavefront-scratch-base + + ((private-address / 4) * wavefront-size * 4) + + (wavefront-lane-id * 4) + (private-address % 4)`` + + If each lane of a wavefront accesses the same private address, the + interleaving results in adjacent dwords being accessed and hence requires + fewer cache lines to be fetched. + + There are different ways that the wavefront scratch base address is + determined by a wavefront (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Scratch memory can be accessed in an interleaved manner using buffer + instructions with the scratch buffer descriptor and per wavefront scratch + offset, by the scratch instructions, or by flat instructions. Multi-dword + access is not supported except by flat and scratch instructions in + GFX9-GFX10. + +**Constant 32-bit** + *TODO* + +**Buffer Fat Pointer** + The buffer fat pointer is an experimental address space that is currently + unsupported in the backend. It exposes a non-integral pointer that is in + the future intended to support the modelling of 128-bit buffer descriptors + plus a 32-bit offset into the buffer (in total encapsulating a 160-bit + *pointer*), allowing normal LLVM load/store/atomic operations to be used to + model the buffer descriptors used heavily in graphics workloads targeting + the backend. + +.. _amdgpu-memory-scopes: + +Memory Scopes +------------- + +This section provides LLVM memory synchronization scopes supported by the AMDGPU +backend memory model when the target triple OS is ``amdhsa`` (see +:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). + +The memory model supported is based on the HSA memory model [HSA]_ which is +based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before +relation is transitive over the synchronizes-with relation independent of scope +and synchronizes-with allows the memory scope instances to be inclusive (see +table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). + +This is different to the OpenCL [OpenCL]_ memory model which does not have scope +inclusion and requires the memory scopes to exactly match. However, this +is conservatively correct for OpenCL. + + .. table:: AMDHSA LLVM Sync Scopes + :name: amdgpu-amdhsa-llvm-sync-scopes-table + + ======================= =================================================== + LLVM Sync Scope Description + ======================= =================================================== + *none* The default: ``system``. + + Synchronizes with, and participates in modification + and seq_cst total orderings with, other operations + (except image operations) for all address spaces + (except private, or generic that accesses private) + provided the other operation's sync scope is: + + - ``system``. + - ``agent`` and executed by a thread on the same + agent. + - ``workgroup`` and executed by a thread in the + same work-group. + - ``wavefront`` and executed by a thread in the + same wavefront. + + ``agent`` Synchronizes with, and participates in modification + and seq_cst total orderings with, other operations + (except image operations) for all address spaces + (except private, or generic that accesses private) + provided the other operation's sync scope is: + + - ``system`` or ``agent`` and executed by a thread + on the same agent. + - ``workgroup`` and executed by a thread in the + same work-group. + - ``wavefront`` and executed by a thread in the + same wavefront. + + ``workgroup`` Synchronizes with, and participates in modification + and seq_cst total orderings with, other operations + (except image operations) for all address spaces + (except private, or generic that accesses private) + provided the other operation's sync scope is: + + - ``system``, ``agent`` or ``workgroup`` and + executed by a thread in the same work-group. + - ``wavefront`` and executed by a thread in the + same wavefront. + + ``wavefront`` Synchronizes with, and participates in modification + and seq_cst total orderings with, other operations + (except image operations) for all address spaces + (except private, or generic that accesses private) + provided the other operation's sync scope is: + + - ``system``, ``agent``, ``workgroup`` or + ``wavefront`` and executed by a thread in the + same wavefront. + + ``singlethread`` Only synchronizes with and participates in + modification and seq_cst total orderings with, + other operations (except image operations) running + in the same thread for all address spaces (for + example, in signal handlers). + + ``one-as`` Same as ``system`` but only synchronizes with other + operations within the same address space. + + ``agent-one-as`` Same as ``agent`` but only synchronizes with other + operations within the same address space. + + ``workgroup-one-as`` Same as ``workgroup`` but only synchronizes with + other operations within the same address space. + + ``wavefront-one-as`` Same as ``wavefront`` but only synchronizes with + other operations within the same address space. + + ``singlethread-one-as`` Same as ``singlethread`` but only synchronizes with + other operations within the same address space. + ======================= =================================================== + +LLVM IR Intrinsics +------------------ + +The AMDGPU backend implements the following LLVM IR intrinsics. + +*This section is WIP.* + +.. TODO:: + + List AMDGPU intrinsics. + +LLVM IR Attributes +------------------ + +The AMDGPU backend supports the following LLVM IR attributes. + + .. table:: AMDGPU LLVM IR Attributes + :name: amdgpu-llvm-ir-attributes-table + + ======================================= ========================================================== + LLVM Attribute Description + ======================================= ========================================================== + "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that + will be specified when the kernel is dispatched. Generated + by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. + "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel + argument block size for the implicit arguments. This + varies by OS and language (for OpenCL see + :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). + "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by + the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. + "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the + ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. + "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per + execution unit. Generated by the ``amdgpu_waves_per_eu`` + CLANG attribute [CLANG-ATTR]_. + "amdgpu-ieee" true/false. Specify whether the function expects the IEEE field of the + mode register to be set on entry. Overrides the default for + the calling convention. + "amdgpu-dx10-clamp" true/false. Specify whether the function expects the DX10_CLAMP field of + the mode register to be set on entry. Overrides the default + for the calling convention. + ======================================= ========================================================== + +.. _amdgpu-elf-code-object: + +ELF Code Object +=============== + +The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that +can be linked by ``lld`` to produce a standard ELF shared code object which can +be loaded and executed on an AMDGPU target. + +.. _amdgpu-elf-header: + +Header +------ + +The AMDGPU backend uses the following ELF header: + + .. table:: AMDGPU ELF Header + :name: amdgpu-elf-header-table + + ========================== =============================== + Field Value + ========================== =============================== + ``e_ident[EI_CLASS]`` ``ELFCLASS64`` + ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` + ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` + - ``ELFOSABI_AMDGPU_HSA`` + - ``ELFOSABI_AMDGPU_PAL`` + - ``ELFOSABI_AMDGPU_MESA3D`` + ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA`` + - ``ELFABIVERSION_AMDGPU_PAL`` + - ``ELFABIVERSION_AMDGPU_MESA3D`` + ``e_type`` - ``ET_REL`` + - ``ET_DYN`` + ``e_machine`` ``EM_AMDGPU`` + ``e_entry`` 0 + ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table` + ========================== =============================== + +.. + + .. table:: AMDGPU ELF Header Enumeration Values + :name: amdgpu-elf-header-enumeration-values-table + + =============================== ===== + Name Value + =============================== ===== + ``EM_AMDGPU`` 224 + ``ELFOSABI_NONE`` 0 + ``ELFOSABI_AMDGPU_HSA`` 64 + ``ELFOSABI_AMDGPU_PAL`` 65 + ``ELFOSABI_AMDGPU_MESA3D`` 66 + ``ELFABIVERSION_AMDGPU_HSA`` 1 + ``ELFABIVERSION_AMDGPU_PAL`` 0 + ``ELFABIVERSION_AMDGPU_MESA3D`` 0 + =============================== ===== + +``e_ident[EI_CLASS]`` + The ELF class is: + + * ``ELFCLASS32`` for ``r600`` architecture. + + * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64-bit + process address space applications. + +``e_ident[EI_DATA]`` + All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. + +``e_ident[EI_OSABI]`` + One of the following AMDGPU target architecture specific OS ABIs + (see :ref:`amdgpu-os-table`): + + * ``ELFOSABI_NONE`` for *unknown* OS. + + * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. + + * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. + + * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. + +``e_ident[EI_ABIVERSION]`` + The ABI version of the AMDGPU target architecture specific OS ABI to which the code + object conforms: + + * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA + runtime ABI. + + * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL + runtime ABI. + + * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA + 3D runtime ABI. + +``e_type`` + Can be one of the following values: + + + ``ET_REL`` + The type produced by the AMDGPU backend compiler as it is relocatable code + object. + + ``ET_DYN`` + The type produced by the linker as it is a shared code object. + + The AMD HSA runtime loader requires a ``ET_DYN`` code object. + +``e_machine`` + The value ``EM_AMDGPU`` is used for the machine for all processors supported + by the ``r600`` and ``amdgcn`` architectures (see + :ref:`amdgpu-processor-table`). The specific processor is specified in the + ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see + :ref:`amdgpu-elf-header-e_flags-table`). + +``e_entry`` + The entry point is 0 as the entry points for individual kernels must be + selected in order to invoke them through AQL packets. + +``e_flags`` + The AMDGPU backend uses the following ELF header flags: + + .. table:: AMDGPU ELF Header ``e_flags`` + :name: amdgpu-elf-header-e_flags-table + + ================================= ========== ============================= + Name Value Description + ================================= ========== ============================= + **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`. + -------------------------------------------- ----------------------------- + ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection + mask for + ``EF_AMDGPU_MACH_xxx`` values + defined in + :ref:`amdgpu-ef-amdgpu-mach-table`. + ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack`` + target feature is + enabled for all code + contained in the code object. + If the processor + does not support the + ``xnack`` target + feature then must + be 0. + See + :ref:`amdgpu-target-features`. + ``EF_AMDGPU_SRAM_ECC`` 0x00000200 Indicates if the ``sram-ecc`` + target feature is + enabled for all code + contained in the code object. + If the processor + does not support the + ``sram-ecc`` target + feature then must + be 0. + See + :ref:`amdgpu-target-features`. + ================================= ========== ============================= + + .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values + :name: amdgpu-ef-amdgpu-mach-table + + ================================= ========== ============================= + Name Value Description (see + :ref:`amdgpu-processor-table`) + ================================= ========== ============================= + ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* + ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` + ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` + ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` + ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` + ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` + ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` + ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` + ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` + ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` + ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` + ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` + ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` + ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` + ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` + ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` + ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` + *reserved* 0x011 - Reserved for ``r600`` + 0x01f architecture processors. + ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` + ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` + ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` + ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` + ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` + ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` + ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` + *reserved* 0x027 Reserved. + ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` + ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` + ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` + ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` + ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` + ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` + ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` + ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` + ``EF_AMDGPU_MACH_AMDGCN_GFX908`` 0x030 ``gfx908`` + ``EF_AMDGPU_MACH_AMDGCN_GFX909`` 0x031 ``gfx909`` + *reserved* 0x032 Reserved. + ``EF_AMDGPU_MACH_AMDGCN_GFX1010`` 0x033 ``gfx1010`` + ``EF_AMDGPU_MACH_AMDGCN_GFX1011`` 0x034 ``gfx1011`` + ``EF_AMDGPU_MACH_AMDGCN_GFX1012`` 0x035 ``gfx1012`` + ``EF_AMDGPU_MACH_AMDGCN_GFX1030`` 0x036 ``gfx1030`` + ``EF_AMDGPU_MACH_AMDGCN_GFX1031`` 0x037 ``gfx1031`` + ================================= ========== ============================= + +Sections +-------- + +An AMDGPU target ELF code object has the standard ELF sections which include: + + .. table:: AMDGPU ELF Sections + :name: amdgpu-elf-sections-table + + ================== ================ ================================= + Name Type Attributes + ================== ================ ================================= + ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` + ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` + ``.debug_``\ *\** ``SHT_PROGBITS`` *none* + ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` + ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` + ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` + ``.note`` ``SHT_NOTE`` *none* + ``.rela``\ *name* ``SHT_RELA`` *none* + ``.rela.dyn`` ``SHT_RELA`` *none* + ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``.shstrtab`` ``SHT_STRTAB`` *none* + ``.strtab`` ``SHT_STRTAB`` *none* + ``.symtab`` ``SHT_SYMTAB`` *none* + ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` + ================== ================ ================================= + +These sections have their standard meanings (see [ELF]_) and are only generated +if needed. + +``.debug``\ *\** + The standard DWARF sections. See :ref:`amdgpu-dwarf-debug-information` for + information on the DWARF produced by the AMDGPU backend. + +``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` + The standard sections used by a dynamic loader. + +``.note`` + See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU + backend. + +``.rela``\ *name*, ``.rela.dyn`` + For relocatable code objects, *name* is the name of the section that the + relocation records apply. For example, ``.rela.text`` is the section name for + relocation records associated with the ``.text`` section. + + For linked shared code objects, ``.rela.dyn`` contains all the relocation + records from each of the relocatable code object's ``.rela``\ *name* sections. + + See :ref:`amdgpu-relocation-records` for the relocation records supported by + the AMDGPU backend. + +``.text`` + The executable machine code for the kernels and functions they call. Generated + as position independent code. See :ref:`amdgpu-code-conventions` for + information on conventions used in the isa generation. + +.. _amdgpu-note-records: + +Note Records +------------ + +The AMDGPU backend code object contains ELF note records in the ``.note`` +section. The set of generated notes and their semantics depend on the code +object version; see :ref:`amdgpu-note-records-v2` and +:ref:`amdgpu-note-records-v3`. + +As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero-byte padding +must be generated after the ``name`` field to ensure the ``desc`` field is 4 +byte aligned. In addition, minimal zero-byte padding must be generated to +ensure the ``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` +field of the ``.note`` section must be at least 4 to indicate at least 8 byte +alignment. + +.. _amdgpu-note-records-v2: + +Code Object V2 Note Records (-mattr=-code-object-v3) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: Code Object V2 is not the default code object version emitted by + this version of LLVM. For a description of the notes generated with the + default configuration (Code Object V3) see :ref:`amdgpu-note-records-v3`. + +The AMDGPU backend code object uses the following ELF note record in the +``.note`` section when compiling for Code Object V2 (-mattr=-code-object-v3). + +Additional note records may be present, but any which are not documented here +are deprecated and should not be used. + + .. table:: AMDGPU Code Object V2 ELF Note Records + :name: amdgpu-elf-note-records-table-v2 + + ===== ============================== ====================================== + Name Type Description + ===== ============================== ====================================== + "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` + ===== ============================== ====================================== + +.. + + .. table:: AMDGPU Code Object V2 ELF Note Record Enumeration Values + :name: amdgpu-elf-note-record-enumeration-values-table-v2 + + ============================== ===== + Name Value + ============================== ===== + *reserved* 0-9 + ``NT_AMD_AMDGPU_HSA_METADATA`` 10 + *reserved* 11 + ============================== ===== + +``NT_AMD_AMDGPU_HSA_METADATA`` + Specifies extensible metadata associated with the code objects executed on HSA + [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when + the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See + :ref:`amdgpu-amdhsa-code-object-metadata-v2` for the syntax of the code + object metadata string. + +.. _amdgpu-note-records-v3: + +Code Object V3 Note Records (-mattr=+code-object-v3) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The AMDGPU backend code object uses the following ELF note record in the +``.note`` section when compiling for Code Object V3 (-mattr=+code-object-v3). + +Additional note records may be present, but any which are not documented here +are deprecated and should not be used. + + .. table:: AMDGPU Code Object V3 ELF Note Records + :name: amdgpu-elf-note-records-table-v3 + + ======== ============================== ====================================== + Name Type Description + ======== ============================== ====================================== + "AMDGPU" ``NT_AMDGPU_METADATA`` Metadata in Message Pack [MsgPack]_ + binary format. + ======== ============================== ====================================== + +.. + + .. table:: AMDGPU Code Object V3 ELF Note Record Enumeration Values + :name: amdgpu-elf-note-record-enumeration-values-table-v3 + + ============================== ===== + Name Value + ============================== ===== + *reserved* 0-31 + ``NT_AMDGPU_METADATA`` 32 + ============================== ===== + +``NT_AMDGPU_METADATA`` + Specifies extensible metadata associated with an AMDGPU code + object. It is encoded as a map in the Message Pack [MsgPack]_ binary + data format. See :ref:`amdgpu-amdhsa-code-object-metadata-v3` for the + map keys defined for the ``amdhsa`` OS. + +.. _amdgpu-symbols: + +Symbols +------- + +Symbols include the following: + + .. table:: AMDGPU ELF Symbols + :name: amdgpu-elf-symbols-table + + ===================== ================== ================ ================== + Name Type Section Description + ===================== ================== ================ ================== + *link-name* ``STT_OBJECT`` - ``.data`` Global variable + - ``.rodata`` + - ``.bss`` + *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor + *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point + *link-name* ``STT_OBJECT`` - SHN_AMDGPU_LDS Global variable in LDS + ===================== ================== ================ ================== + +Global variable + Global variables both used and defined by the compilation unit. + + If the symbol is defined in the compilation unit then it is allocated in the + appropriate section according to if it has initialized data or is readonly. + + If the symbol is external then its section is ``STN_UNDEF`` and the loader + will resolve relocations using the definition provided by another code object + or explicitly defined by the runtime. + + If the symbol resides in local/group memory (LDS) then its section is the + special processor specific section name ``SHN_AMDGPU_LDS``, and the + ``st_value`` field describes alignment requirements as it does for common + symbols. + + .. TODO:: + + Add description of linked shared object symbols. Seems undefined symbols + are marked as STT_NOTYPE. + +Kernel descriptor + Every HSA kernel has an associated kernel descriptor. It is the address of the + kernel descriptor that is used in the AQL dispatch packet used to invoke the + kernel, not the kernel entry point. The layout of the HSA kernel descriptor is + defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. + +Kernel entry point + Every HSA kernel also has a symbol for its machine code entry point. + +.. _amdgpu-relocation-records: + +Relocation Records +------------------ + +AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported +relocatable fields are: + +``word32`` + This specifies a 32-bit field occupying 4 bytes with arbitrary byte + alignment. These values use the same byte order as other word values in the + AMDGPU architecture. + +``word64`` + This specifies a 64-bit field occupying 8 bytes with arbitrary byte + alignment. These values use the same byte order as other word values in the + AMDGPU architecture. + +Following notations are used for specifying relocation calculations: + +**A** + Represents the addend used to compute the value of the relocatable field. + +**G** + Represents the offset into the global offset table at which the relocation + entry's symbol will reside during execution. + +**GOT** + Represents the address of the global offset table. + +**P** + Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) + of the storage unit being relocated (computed using ``r_offset``). + +**S** + Represents the value of the symbol whose index resides in the relocation + entry. Relocations not using this must specify a symbol index of + ``STN_UNDEF``. + +**B** + Represents the base address of a loaded executable or shared object which is + the difference between the ELF address and the actual load address. + Relocations using this are only valid in executable or shared objects. + +The following relocation types are supported: + + .. table:: AMDGPU ELF Relocation Records + :name: amdgpu-elf-relocation-records-table + + ========================== ======= ===== ========== ============================== + Relocation Type Kind Value Field Calculation + ========================== ======= ===== ========== ============================== + ``R_AMDGPU_NONE`` 0 *none* *none* + ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF + Dynamic + ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 + Dynamic + ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A + Dynamic + ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P + ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P + ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A + Dynamic + ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P + ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF + ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 + ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF + ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 + *reserved* 12 + ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A + ========================== ======= ===== ========== ============================== + +``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by +the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. + +There is no current OS loader support for 32-bit programs and so +``R_AMDGPU_ABS32`` is not used. + +.. _amdgpu-loaded-code-object-path-uniform-resource-identifier: + +Loaded Code Object Path Uniform Resource Identifier (URI) +--------------------------------------------------------- + +The AMD GPU code object loader represents the path of the ELF shared object from +which the code object was loaded as a textual Unifom Resource Identifier (URI). +Note that the code object is the in memory loaded relocated form of the ELF +shared object. Multiple code objects may be loaded at different memory +addresses in the same process from the same ELF shared object. + +The loaded code object path URI syntax is defined by the following BNF syntax: + +.. code:: + + code_object_uri ::== file_uri | memory_uri + file_uri ::== "file://" file_path [ range_specifier ] + memory_uri ::== "memory://" process_id range_specifier + range_specifier ::== [ "#" | "?" ] "offset=" number "&" "size=" number + file_path ::== URI_ENCODED_OS_FILE_PATH + process_id ::== DECIMAL_NUMBER + number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER + +**number** + Is a C integral literal where hexadecimal values are prefixed by "0x" or "0X", + and octal values by "0". + +**file_path** + Is the file's path specified as a URI encoded UTF-8 string. In URI encoding, + every character that is not in the regular expression ``[a-zA-Z0-9/_.~-]`` is + encoded as two uppercase hexadecimal digits proceeded by "%". Directories in + the path are separated by "/". + +**offset** + Is a 0-based byte offset to the start of the code object. For a file URI, it + is from the start of the file specified by the ``file_path``, and if omitted + defaults to 0. For a memory URI, it is the memory address and is required. + +**size** + Is the number of bytes in the code object. For a file URI, if omitted it + defaults to the size of the file. It is required for a memory URI. + +**process_id** + Is the identity of the process owning the memory. For Linux it is the C + unsigned integral decimal literal for the process ID (PID). + +For example: + +.. code:: + + file:///dir1/dir2/file1 + file:///dir3/dir4/file2#offset=0x2000&size=3000 + memory://1234#offset=0x20000&size=3000 + +.. _amdgpu-dwarf-debug-information: + +DWARF Debug Information +======================= + +.. warning:: + + This section describes **provisional support** for AMDGPU DWARF [DWARF]_ that + is not currently fully implemented and is subject to change. + +AMDGPU generates DWARF [DWARF]_ debugging information ELF sections (see +:ref:`amdgpu-elf-code-object`) which contain information that maps the code +object executable code and data to the source language constructs. It can be +used by tools such as debuggers and profilers. It uses features defined in +:doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging` that are made available in +DWARF Version 4 and DWARF Version 5 as an LLVM vendor extension. + +This section defines the AMDGPU target architecture specific DWARF mappings. + +.. _amdgpu-dwarf-register-identifier: + +Register Identifier +------------------- + +This section defines the AMDGPU target architecture register numbers used in +DWARF operation expressions (see DWARF Version 5 section 2.5 and +:ref:`amdgpu-dwarf-operation-expressions`) and Call Frame Information +instructions (see DWARF Version 5 section 6.4 and +:ref:`amdgpu-dwarf-call-frame-information`). + +A single code object can contain code for kernels that have different wavefront +sizes. The vector registers and some scalar registers are based on the wavefront +size. AMDGPU defines distinct DWARF registers for each wavefront size. This +simplifies the consumer of the DWARF so that each register has a fixed size, +rather than being dynamic according to the wavefront size mode. Similarly, +distinct DWARF registers are defined for those registers that vary in size +according to the process address size. This allows a consumer to treat a +specific AMDGPU processor as a single architecture regardless of how it is +configured at run time. The compiler explicitly specifies the DWARF registers +that match the mode in which the code it is generating will be executed. + +DWARF registers are encoded as numbers, which are mapped to architecture +registers. The mapping for AMDGPU is defined in +:ref:`amdgpu-dwarf-register-mapping-table`. All AMDGPU targets use the same +mapping. + +.. table:: AMDGPU DWARF Register Mapping + :name: amdgpu-dwarf-register-mapping-table + + ============== ================= ======== ================================== + DWARF Register AMDGPU Register Bit Size Description + ============== ================= ======== ================================== + 0 PC_32 32 Program Counter (PC) when + executing in a 32-bit process + address space. Used in the CFI to + describe the PC of the calling + frame. + 1 EXEC_MASK_32 32 Execution Mask Register when + executing in wavefront 32 mode. + 2-15 *Reserved* *Reserved for highly accessed + registers using DWARF shortcut.* + 16 PC_64 64 Program Counter (PC) when + executing in a 64-bit process + address space. Used in the CFI to + describe the PC of the calling + frame. + 17 EXEC_MASK_64 64 Execution Mask Register when + executing in wavefront 64 mode. + 18-31 *Reserved* *Reserved for highly accessed + registers using DWARF shortcut.* + 32-95 SGPR0-SGPR63 32 Scalar General Purpose + Registers. + 96-127 *Reserved* *Reserved for frequently accessed + registers using DWARF 1-byte ULEB.* + 128 SCC 32 Scalar Condition Code Register. + 129-511 *Reserved* *Reserved for future Scalar + Architectural Registers.* + 512 VCC_32 32 Vector Condition Code Register + when executing in wavefront 32 + mode. + 513-1023 *Reserved* *Reserved for future Vector + Architectural Registers when + executing in wavefront 32 mode.* + 768 VCC_64 32 Vector Condition Code Register + when executing in wavefront 64 + mode. + 769-1023 *Reserved* *Reserved for future Vector + Architectural Registers when + executing in wavefront 64 mode.* + 1024-1087 *Reserved* *Reserved for padding.* + 1088-1129 SGPR64-SGPR105 32 Scalar General Purpose Registers. + 1130-1535 *Reserved* *Reserved for future Scalar + General Purpose Registers.* + 1536-1791 VGPR0-VGPR255 32*32 Vector General Purpose Registers + when executing in wavefront 32 + mode. + 1792-2047 *Reserved* *Reserved for future Vector + General Purpose Registers when + executing in wavefront 32 mode.* + 2048-2303 AGPR0-AGPR255 32*32 Vector Accumulation Registers + when executing in wavefront 32 + mode. + 2304-2559 *Reserved* *Reserved for future Vector + Accumulation Registers when + executing in wavefront 32 mode.* + 2560-2815 VGPR0-VGPR255 64*32 Vector General Purpose Registers + when executing in wavefront 64 + mode. + 2816-3071 *Reserved* *Reserved for future Vector + General Purpose Registers when + executing in wavefront 64 mode.* + 3072-3327 AGPR0-AGPR255 64*32 Vector Accumulation Registers + when executing in wavefront 64 + mode. + 3328-3583 *Reserved* *Reserved for future Vector + Accumulation Registers when + executing in wavefront 64 mode.* + ============== ================= ======== ================================== + +The vector registers are represented as the full size for the wavefront. They +are organized as consecutive dwords (32-bits), one per lane, with the dword at +the least significant bit position corresponding to lane 0 and so forth. DWARF +location expressions involving the ``DW_OP_LLVM_offset`` and +``DW_OP_LLVM_push_lane`` operations are used to select the part of the vector +register corresponding to the lane that is executing the current thread of +execution in languages that are implemented using a SIMD or SIMT execution +model. + +If the wavefront size is 32 lanes then the wavefront 32 mode register +definitions are used. If the wavefront size is 64 lanes then the wavefront 64 +mode register definitions are used. Some AMDGPU targets support executing in +both wavefront 32 and wavefront 64 mode. The register definitions corresponding +to the wavefront mode of the generated code will be used. + +If code is generated to execute in a 32-bit process address space, then the +32-bit process address space register definitions are used. If code is generated +to execute in a 64-bit process address space, then the 64-bit process address +space register definitions are used. The ``amdgcn`` target only supports the +64-bit process address space. + +.. _amdgpu-dwarf-address-class-identifier: + +Address Class Identifier +------------------------ + +The DWARF address class represents the source language memory space. See DWARF +Version 5 section 2.12 which is updated by the *DWARF Extensions For +Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. + +The DWARF address class mapping used for AMDGPU is defined in +:ref:`amdgpu-dwarf-address-class-mapping-table`. + +.. table:: AMDGPU DWARF Address Class Mapping + :name: amdgpu-dwarf-address-class-mapping-table + + ========================= ====== ================= + DWARF AMDGPU + -------------------------------- ----------------- + Address Class Name Value Address Space + ========================= ====== ================= + ``DW_ADDR_none`` 0x0000 Generic (Flat) + ``DW_ADDR_LLVM_global`` 0x0001 Global + ``DW_ADDR_LLVM_constant`` 0x0002 Global + ``DW_ADDR_LLVM_group`` 0x0003 Local (group/LDS) + ``DW_ADDR_LLVM_private`` 0x0004 Private (Scratch) + ``DW_ADDR_AMDGPU_region`` 0x8000 Region (GDS) + ========================= ====== ================= + +The DWARF address class values defined in the *DWARF Extensions For +Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses` are used. + +In addition, ``DW_ADDR_AMDGPU_region`` is encoded as a vendor extension. This is +available for use for the AMD extension for access to the hardware GDS memory +which is scratchpad memory allocated per device. + +For AMDGPU if no ``DW_AT_address_class`` attribute is present, then the default +address class of ``DW_ADDR_none`` is used. + +See :ref:`amdgpu-dwarf-address-space-identifier` for information on the AMDGPU +mapping of DWARF address classes to DWARF address spaces, including address size +and NULL value. + +.. _amdgpu-dwarf-address-space-identifier: + +Address Space Identifier +------------------------ + +DWARF address spaces correspond to target architecture specific linear +addressable memory areas. See DWARF Version 5 section 2.12 and *DWARF Extensions +For Heterogeneous Debugging* section :ref:`amdgpu-dwarf-segment_addresses`. + +The DWARF address space mapping used for AMDGPU is defined in +:ref:`amdgpu-dwarf-address-space-mapping-table`. + +.. table:: AMDGPU DWARF Address Space Mapping + :name: amdgpu-dwarf-address-space-mapping-table + + ======================================= ===== ======= ======== ================= ======================= + DWARF AMDGPU Notes + --------------------------------------- ----- ---------------- ----------------- ----------------------- + Address Space Name Value Address Bit Size Address Space + --------------------------------------- ----- ------- -------- ----------------- ----------------------- + .. 64-bit 32-bit + process process + address address + space space + ======================================= ===== ======= ======== ================= ======================= + ``DW_ASPACE_none`` 0x00 8 4 Global *default address space* + ``DW_ASPACE_AMDGPU_generic`` 0x01 8 4 Generic (Flat) + ``DW_ASPACE_AMDGPU_region`` 0x02 4 4 Region (GDS) + ``DW_ASPACE_AMDGPU_local`` 0x03 4 4 Local (group/LDS) + *Reserved* 0x04 + ``DW_ASPACE_AMDGPU_private_lane`` 0x05 4 4 Private (Scratch) *focused lane* + ``DW_ASPACE_AMDGPU_private_wave`` 0x06 4 4 Private (Scratch) *unswizzled wavefront* + ======================================= ===== ======= ======== ================= ======================= + +See :ref:`amdgpu-address-spaces` for information on the AMDGPU address spaces +including address size and NULL value. + +The ``DW_ASPACE_none`` address space is the default target architecture address +space used in DWARF operations that do not specify an address space. It +therefore has to map to the global address space so that the ``DW_OP_addr*`` and +related operations can refer to addresses in the program code. + +The ``DW_ASPACE_AMDGPU_generic`` address space allows location expressions to +specify the flat address space. If the address corresponds to an address in the +local address space, then it corresponds to the wavefront that is executing the +focused thread of execution. If the address corresponds to an address in the +private address space, then it corresponds to the lane that is executing the +focused thread of execution for languages that are implemented using a SIMD or +SIMT execution model. + +.. note:: + + CUDA-like languages such as HIP that do not have address spaces in the + language type system, but do allow variables to be allocated in different + address spaces, need to explicitly specify the ``DW_ASPACE_AMDGPU_generic`` + address space in the DWARF expression operations as the default address space + is the global address space. + +The ``DW_ASPACE_AMDGPU_local`` address space allows location expressions to +specify the local address space corresponding to the wavefront that is executing +the focused thread of execution. + +The ``DW_ASPACE_AMDGPU_private_lane`` address space allows location expressions +to specify the private address space corresponding to the lane that is executing +the focused thread of execution for languages that are implemented using a SIMD +or SIMT execution model. + +The ``DW_ASPACE_AMDGPU_private_wave`` address space allows location expressions +to specify the unswizzled private address space corresponding to the wavefront +that is executing the focused thread of execution. The wavefront view of private +memory is the per wavefront unswizzled backing memory layout defined in +:ref:`amdgpu-address-spaces`, such that address 0 corresponds to the first +location for the backing memory of the wavefront (namely the address is not +offset by ``wavefront-scratch-base``). The following formula can be used to +convert from a ``DW_ASPACE_AMDGPU_private_lane`` address to a +``DW_ASPACE_AMDGPU_private_wave`` address: + +:: + + private-address-wavefront = + ((private-address-lane / 4) * wavefront-size * 4) + + (wavefront-lane-id * 4) + (private-address-lane % 4) + +If the ``DW_ASPACE_AMDGPU_private_lane`` address is dword aligned, and the start +of the dwords for each lane starting with lane 0 is required, then this +simplifies to: + +:: + + private-address-wavefront = + private-address-lane * wavefront-size + +A compiler can use the ``DW_ASPACE_AMDGPU_private_wave`` address space to read a +complete spilled vector register back into a complete vector register in the +CFI. The frame pointer can be a private lane address which is dword aligned, +which can be shifted to multiply by the wavefront size, and then used to form a +private wavefront address that gives a location for a contiguous set of dwords, +one per lane, where the vector register dwords are spilled. The compiler knows +the wavefront size since it generates the code. Note that the type of the +address may have to be converted as the size of a +``DW_ASPACE_AMDGPU_private_lane`` address may be smaller than the size of a +``DW_ASPACE_AMDGPU_private_wave`` address. + +.. _amdgpu-dwarf-lane-identifier: + +Lane identifier +--------------- + +DWARF lane identifies specify a target architecture lane position for hardware +that executes in a SIMD or SIMT manner, and on which a source language maps its +threads of execution onto those lanes. The DWARF lane identifier is pushed by +the ``DW_OP_LLVM_push_lane`` DWARF expression operation. See DWARF Version 5 +section 2.5 which is updated by *DWARF Extensions For Heterogeneous Debugging* +section :ref:`amdgpu-dwarf-operation-expressions`. + +For AMDGPU, the lane identifier corresponds to the hardware lane ID of a +wavefront. It is numbered from 0 to the wavefront size minus 1. + +Operation Expressions +--------------------- + +DWARF expressions are used to compute program values and the locations of +program objects. See DWARF Version 5 section 2.5 and +:ref:`amdgpu-dwarf-operation-expressions`. + +DWARF location descriptions describe how to access storage which includes memory +and registers. When accessing storage on AMDGPU, bytes are ordered with least +significant bytes first, and bits are ordered within bytes with least +significant bits first. + +For AMDGPU CFI expressions, ``DW_OP_LLVM_select_bit_piece`` is used to describe +unwinding vector registers that are spilled under the execution mask to memory: +the zero-single location description is the vector register, and the one-single +location description is the spilled memory location description. The +``DW_OP_LLVM_form_aspace_address`` is used to specify the address space of the +memory location description. + +In AMDGPU expressions, ``DW_OP_LLVM_select_bit_piece`` is used by the +``DW_AT_LLVM_lane_pc`` attribute expression where divergent control flow is +controlled by the execution mask. An undefined location description together +with ``DW_OP_LLVM_extend`` is used to indicate the lane was not active on entry +to the subprogram. See :ref:`amdgpu-dwarf-dw-at-llvm-lane-pc` for an example. + +Debugger Information Entry Attributes +------------------------------------- + +This section describes how certain debugger information entry attributes are +used by AMDGPU. See the sections in DWARF Version 5 section 2 which are updated +by *DWARF Extensions For Heterogeneous Debugging* section +:ref:`amdgpu-dwarf-debugging-information-entry-attributes`. + +.. _amdgpu-dwarf-dw-at-llvm-lane-pc: + +``DW_AT_LLVM_lane_pc`` +~~~~~~~~~~~~~~~~~~~~~~ + +For AMDGPU, the ``DW_AT_LLVM_lane_pc`` attribute is used to specify the program +location of the separate lanes of a SIMT thread. + +If the lane is an active lane then this will be the same as the current program +location. + +If the lane is inactive, but was active on entry to the subprogram, then this is +the program location in the subprogram at which execution of the lane is +conceptual positioned. + +If the lane was not active on entry to the subprogram, then this will be the +undefined location. A client debugger can check if the lane is part of a valid +work-group by checking that the lane is in the range of the associated +work-group within the grid, accounting for partial work-groups. If it is not, +then the debugger can omit any information for the lane. Otherwise, the debugger +may repeatedly unwind the stack and inspect the ``DW_AT_LLVM_lane_pc`` of the +calling subprogram until it finds a non-undefined location. Conceptually the +lane only has the call frames that it has a non-undefined +``DW_AT_LLVM_lane_pc``. + +The following example illustrates how the AMDGPU backend can generate a DWARF +location list expression for the nested ``IF/THEN/ELSE`` structures of the +following subprogram pseudo code for a target with 64 lanes per wavefront. + +.. code:: + :number-lines: + + SUBPROGRAM X + BEGIN + a; + IF (c1) THEN + b; + IF (c2) THEN + c; + ELSE + d; + ENDIF + e; + ELSE + f; + ENDIF + g; + END + +The AMDGPU backend may generate the following pseudo LLVM MIR to manipulate the +execution mask (``EXEC``) to linearize the control flow. The condition is +evaluated to make a mask of the lanes for which the condition evaluates to true. +First the ``THEN`` region is executed by setting the ``EXEC`` mask to the +logical ``AND`` of the current ``EXEC`` mask with the condition mask. Then the +``ELSE`` region is executed by negating the ``EXEC`` mask and logical ``AND`` of +the saved ``EXEC`` mask at the start of the region. After the ``IF/THEN/ELSE`` +region the ``EXEC`` mask is restored to the value it had at the beginning of the +region. This is shown below. Other approaches are possible, but the basic +concept is the same. + +.. code:: + :number-lines: + + $lex_start: + a; + %1 = EXEC + %2 = c1 + $lex_1_start: + EXEC = %1 & %2 + $if_1_then: + b; + %3 = EXEC + %4 = c2 + $lex_1_1_start: + EXEC = %3 & %4 + $lex_1_1_then: + c; + EXEC = ~EXEC & %3 + $lex_1_1_else: + d; + EXEC = %3 + $lex_1_1_end: + e; + EXEC = ~EXEC & %1 + $lex_1_else: + f; + EXEC = %1 + $lex_1_end: + g; + $lex_end: + +To create the DWARF location list expression that defines the location +description of a vector of lane program locations, the LLVM MIR ``DBG_VALUE`` +pseudo instruction can be used to annotate the linearized control flow. This can +be done by defining an artificial variable for the lane PC. The DWARF location +list expression created for it is used as the value of the +``DW_AT_LLVM_lane_pc`` attribute on the subprogram's debugger information entry. + +A DWARF procedure is defined for each well nested structured control flow region +which provides the conceptual lane program location for a lane if it is not +active (namely it is divergent). The DWARF operation expression for each region +conceptually inherits the value of the immediately enclosing region and modifies +it according to the semantics of the region. + +For an ``IF/THEN/ELSE`` region the divergent program location is at the start of +the region for the ``THEN`` region since it is executed first. For the ``ELSE`` +region the divergent program location is at the end of the ``IF/THEN/ELSE`` +region since the ``THEN`` region has completed. + +The lane PC artificial variable is assigned at each region transition. It uses +the immediately enclosing region's DWARF procedure to compute the program +location for each lane assuming they are divergent, and then modifies the result +by inserting the current program location for each lane that the ``EXEC`` mask +indicates is active. + +By having separate DWARF procedures for each region, they can be reused to +define the value for any nested region. This reduces the total size of the DWARF +operation expressions. + +The following provides an example using pseudo LLVM MIR. + +.. code:: + :number-lines: + + $lex_start: + DEFINE_DWARF %__uint_64 = DW_TAG_base_type[ + DW_AT_name = "__uint64"; + DW_AT_byte_size = 8; + DW_AT_encoding = DW_ATE_unsigned; + ]; + DEFINE_DWARF %__active_lane_pc = DW_TAG_dwarf_procedure[ + DW_AT_name = "__active_lane_pc"; + DW_AT_location = [ + DW_OP_regx PC; + DW_OP_LLVM_extend 64, 64; + DW_OP_regval_type EXEC, %uint_64; + DW_OP_LLVM_select_bit_piece 64, 64; + ]; + ]; + DEFINE_DWARF %__divergent_lane_pc = DW_TAG_dwarf_procedure[ + DW_AT_name = "__divergent_lane_pc"; + DW_AT_location = [ + DW_OP_LLVM_undefined; + DW_OP_LLVM_extend 64, 64; + ]; + ]; + DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ + DW_OP_call_ref %__divergent_lane_pc; + DW_OP_call_ref %__active_lane_pc; + ]; + a; + %1 = EXEC; + DBG_VALUE %1, $noreg, %__lex_1_save_exec; + %2 = c1; + $lex_1_start: + EXEC = %1 & %2; + $lex_1_then: + DEFINE_DWARF %__divergent_lane_pc_1_then = DW_TAG_dwarf_procedure[ + DW_AT_name = "__divergent_lane_pc_1_then"; + DW_AT_location = DIExpression[ + DW_OP_call_ref %__divergent_lane_pc; + DW_OP_addrx &lex_1_start; + DW_OP_stack_value; + DW_OP_LLVM_extend 64, 64; + DW_OP_call_ref %__lex_1_save_exec; + DW_OP_deref_type 64, %__uint_64; + DW_OP_LLVM_select_bit_piece 64, 64; + ]; + ]; + DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ + DW_OP_call_ref %__divergent_lane_pc_1_then; + DW_OP_call_ref %__active_lane_pc; + ]; + b; + %3 = EXEC; + DBG_VALUE %3, %__lex_1_1_save_exec; + %4 = c2; + $lex_1_1_start: + EXEC = %3 & %4; + $lex_1_1_then: + DEFINE_DWARF %__divergent_lane_pc_1_1_then = DW_TAG_dwarf_procedure[ + DW_AT_name = "__divergent_lane_pc_1_1_then"; + DW_AT_location = DIExpression[ + DW_OP_call_ref %__divergent_lane_pc_1_then; + DW_OP_addrx &lex_1_1_start; + DW_OP_stack_value; + DW_OP_LLVM_extend 64, 64; + DW_OP_call_ref %__lex_1_1_save_exec; + DW_OP_deref_type 64, %__uint_64; + DW_OP_LLVM_select_bit_piece 64, 64; + ]; + ]; + DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ + DW_OP_call_ref %__divergent_lane_pc_1_1_then; + DW_OP_call_ref %__active_lane_pc; + ]; + c; + EXEC = ~EXEC & %3; + $lex_1_1_else: + DEFINE_DWARF %__divergent_lane_pc_1_1_else = DW_TAG_dwarf_procedure[ + DW_AT_name = "__divergent_lane_pc_1_1_else"; + DW_AT_location = DIExpression[ + DW_OP_call_ref %__divergent_lane_pc_1_then; + DW_OP_addrx &lex_1_1_end; + DW_OP_stack_value; + DW_OP_LLVM_extend 64, 64; + DW_OP_call_ref %__lex_1_1_save_exec; + DW_OP_deref_type 64, %__uint_64; + DW_OP_LLVM_select_bit_piece 64, 64; + ]; + ]; + DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ + DW_OP_call_ref %__divergent_lane_pc_1_1_else; + DW_OP_call_ref %__active_lane_pc; + ]; + d; + EXEC = %3; + $lex_1_1_end: + DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ + DW_OP_call_ref %__divergent_lane_pc; + DW_OP_call_ref %__active_lane_pc; + ]; + e; + EXEC = ~EXEC & %1; + $lex_1_else: + DEFINE_DWARF %__divergent_lane_pc_1_else = DW_TAG_dwarf_procedure[ + DW_AT_name = "__divergent_lane_pc_1_else"; + DW_AT_location = DIExpression[ + DW_OP_call_ref %__divergent_lane_pc; + DW_OP_addrx &lex_1_end; + DW_OP_stack_value; + DW_OP_LLVM_extend 64, 64; + DW_OP_call_ref %__lex_1_save_exec; + DW_OP_deref_type 64, %__uint_64; + DW_OP_LLVM_select_bit_piece 64, 64; + ]; + ]; + DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc, DIExpression[ + DW_OP_call_ref %__divergent_lane_pc_1_else; + DW_OP_call_ref %__active_lane_pc; + ]; + f; + EXEC = %1; + $lex_1_end: + DBG_VALUE $noreg, $noreg, %DW_AT_LLVM_lane_pc DIExpression[ + DW_OP_call_ref %__divergent_lane_pc; + DW_OP_call_ref %__active_lane_pc; + ]; + g; + $lex_end: + +The DWARF procedure ``%__active_lane_pc`` is used to update the lane pc elements +that are active, with the current program location. + +Artificial variables %__lex_1_save_exec and %__lex_1_1_save_exec are created for +the execution masks saved on entry to a region. Using the ``DBG_VALUE`` pseudo +instruction, location list entries will be created that describe where the +artificial variables are allocated at any given program location. The compiler +may allocate them to registers or spill them to memory. + +The DWARF procedures for each region use the values of the saved execution mask +artificial variables to only update the lanes that are active on entry to the +region. All other lanes retain the value of the enclosing region where they were +last active. If they were not active on entry to the subprogram, then will have +the undefined location description. + +Other structured control flow regions can be handled similarly. For example, +loops would set the divergent program location for the region at the end of the +loop. Any lanes active will be in the loop, and any lanes not active must have +exited the loop. + +An ``IF/THEN/ELSEIF/ELSEIF/...`` region can be treated as a nest of +``IF/THEN/ELSE`` regions. + +The DWARF procedures can use the active lane artificial variable described in +:ref:`amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane` rather than the actual +``EXEC`` mask in order to support whole or quad wavefront mode. + +.. _amdgpu-dwarf-amdgpu-dw-at-llvm-active-lane: + +``DW_AT_LLVM_active_lane`` +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``DW_AT_LLVM_active_lane`` attribute on a subprogram debugger information +entry is used to specify the lanes that are conceptually active for a SIMT +thread. + +The execution mask may be modified to implement whole or quad wavefront mode +operations. For example, all lanes may need to temporarily be made active to +execute a whole wavefront operation. Such regions would save the ``EXEC`` mask, +update it to enable the necessary lanes, perform the operations, and then +restore the ``EXEC`` mask from the saved value. While executing the whole +wavefront region, the conceptual execution mask is the saved value, not the +``EXEC`` value. + +This is handled by defining an artificial variable for the active lane mask. The +active lane mask artificial variable would be the actual ``EXEC`` mask for +normal regions, and the saved execution mask for regions where the mask is +temporarily updated. The location list expression created for this artificial +variable is used to define the value of the ``DW_AT_LLVM_active_lane`` +attribute. + +``DW_AT_LLVM_augmentation`` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For AMDGPU, the ``DW_AT_LLVM_augmentation`` attribute of a compilation unit +debugger information entry has the following value for the augmentation string: + +:: + + [amdgpu:v0.0] + +The "vX.Y" specifies the major X and minor Y version number of the AMDGPU +extensions used in the DWARF of the compilation unit. The version number +conforms to [SEMVER]_. + +Call Frame Information +---------------------- + +DWARF Call Frame Information (CFI) describes how a consumer can virtually +*unwind* call frames in a running process or core dump. See DWARF Version 5 +section 6.4 and :ref:`amdgpu-dwarf-call-frame-information`. + +For AMDGPU, the Common Information Entry (CIE) fields have the following values: + +1. ``augmentation`` string contains the following null-terminated UTF-8 string: + + :: + + [amd:v0.0] + + The ``vX.Y`` specifies the major X and minor Y version number of the AMDGPU + extensions used in this CIE or to the FDEs that use it. The version number + conforms to [SEMVER]_. + +2. ``address_size`` for the ``Global`` address space is defined in + :ref:`amdgpu-dwarf-address-space-identifier`. + +3. ``segment_selector_size`` is 0 as AMDGPU does not use a segment selector. + +4. ``code_alignment_factor`` is 4 bytes. + + .. TODO:: + + Add to :ref:`amdgpu-processor-table` table. + +5. ``data_alignment_factor`` is 4 bytes. + + .. TODO:: + + Add to :ref:`amdgpu-processor-table` table. + +6. ``return_address_register`` is ``PC_32`` for 32-bit processes and ``PC_64`` + for 64-bit processes defined in :ref:`amdgpu-dwarf-register-identifier`. + +7. ``initial_instructions`` Since a subprogram X with fewer registers can be + called from subprogram Y that has more allocated, X will not change any of + the extra registers as it cannot access them. Therefore, the default rule + for all columns is ``same value``. + +For AMDGPU the register number follows the numbering defined in +:ref:`amdgpu-dwarf-register-identifier`. + +For AMDGPU the instructions are variable size. A consumer can subtract 1 from +the return address to get the address of a byte within the call site +instructions. See DWARF Version 5 section 6.4.4. + +Accelerated Access +------------------ + +See DWARF Version 5 section 6.1. + +Lookup By Name Section Header +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +See DWARF Version 5 section 6.1.1.4.1 and :ref:`amdgpu-dwarf-lookup-by-name`. + +For AMDGPU the lookup by name section header table: + +``augmentation_string_size`` (uword) + + Set to the length of the ``augmentation_string`` value which is always a + multiple of 4. + +``augmentation_string`` (sequence of UTF-8 characters) + + Contains the following UTF-8 string null padded to a multiple of 4 bytes: + + :: + + [amdgpu:v0.0] + + The "vX.Y" specifies the major X and minor Y version number of the AMDGPU + extensions used in the DWARF of this index. The version number conforms to + [SEMVER]_. + + .. note:: + + This is different to the DWARF Version 5 definition that requires the first + 4 characters to be the vendor ID. But this is consistent with the other + augmentation strings and does allow multiple vendor contributions. However, + backwards compatibility may be more desirable. + +Lookup By Address Section Header +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +See DWARF Version 5 section 6.1.2. + +For AMDGPU the lookup by address section header table: + +``address_size`` (ubyte) + + Match the address size for the ``Global`` address space defined in + :ref:`amdgpu-dwarf-address-space-identifier`. + +``segment_selector_size`` (ubyte) + + AMDGPU does not use a segment selector so this is 0. The entries in the + ``.debug_aranges`` do not have a segment selector. + +Line Number Information +----------------------- + +See DWARF Version 5 section 6.2 and :ref:`amdgpu-dwarf-line-number-information`. + +AMDGPU does not use the ``isa`` state machine registers and always sets it to 0. +The instruction set must be obtained from the ELF file header ``e_flags`` field +in the ``EF_AMDGPU_MACH`` bit position (see :ref:`ELF Header +`). See DWARF Version 5 section 6.2.2. + +.. TODO:: + + Should the ``isa`` state machine register be used to indicate if the code is + in wavefront32 or wavefront64 mode? Or used to specify the architecture ISA? + +For AMDGPU the line number program header fields have the following values (see +DWARF Version 5 section 6.2.4): + +``address_size`` (ubyte) + Matches the address size for the ``Global`` address space defined in + :ref:`amdgpu-dwarf-address-space-identifier`. + +``segment_selector_size`` (ubyte) + AMDGPU does not use a segment selector so this is 0. + +``minimum_instruction_length`` (ubyte) + For GFX9-GFX10 this is 4. + +``maximum_operations_per_instruction`` (ubyte) + For GFX9-GFX10 this is 1. + +Source text for online-compiled programs (for example, those compiled by the +OpenCL language runtime) may be embedded into the DWARF Version 5 line table. +See DWARF Version 5 section 6.2.4.1 which is updated by *DWARF Extensions For +Heterogeneous Debugging* section :ref:`DW_LNCT_LLVM_source +`. + +The Clang option used to control source embedding in AMDGPU is defined in +:ref:`amdgpu-clang-debug-options-table`. + + .. table:: AMDGPU Clang Debug Options + :name: amdgpu-clang-debug-options-table + + ==================== ================================================== + Debug Flag Description + ==================== ================================================== + -g[no-]embed-source Enable/disable embedding source text in DWARF + debug sections. Useful for environments where + source cannot be written to disk, such as + when performing online compilation. + ==================== ================================================== + +For example: + +``-gembed-source`` + Enable the embedded source. + +``-gno-embed-source`` + Disable the embedded source. + +32-Bit and 64-Bit DWARF Formats +------------------------------- + +See DWARF Version 5 section 7.4 and +:ref:`amdgpu-dwarf-32-bit-and-64-bit-dwarf-formats`. + +For AMDGPU: + +* For the ``amdgcn`` target architecture only the 64-bit process address space + is supported. + +* The producer can generate either 32-bit or 64-bit DWARF format. LLVM generates + the 32-bit DWARF format. + +Unit Headers +------------ + +For AMDGPU the following values apply for each of the unit headers described in +DWARF Version 5 sections 7.5.1.1, 7.5.1.2, and 7.5.1.3: + +``address_size`` (ubyte) + Matches the address size for the ``Global`` address space defined in + :ref:`amdgpu-dwarf-address-space-identifier`. + +.. _amdgpu-code-conventions: + +Code Conventions +================ + +This section provides code conventions used for each supported target triple OS +(see :ref:`amdgpu-target-triples`). + +AMDHSA +------ + +This section provides code conventions used when the target triple OS is +``amdhsa`` (see :ref:`amdgpu-target-triples`). + +.. _amdgpu-amdhsa-code-object-target-identification: + +Code Object Target Identification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The AMDHSA OS uses the following syntax to specify the code object +target as a single string: + + ``----`` + +Where: + + - ````, ````, ```` and ```` + are the same as the *Target Triple* (see + :ref:`amdgpu-target-triples`). + + - ```` is the same as the *Processor* (see + :ref:`amdgpu-processors`). + + - ```` is a list of the enabled *Target Features* + (see :ref:`amdgpu-target-features`), each prefixed by a plus, that + apply to *Processor*. The list must be in the same order as listed + in the table :ref:`amdgpu-target-feature-table`. Note that *Target + Features* must be included in the list if they are enabled even if + that is the default for *Processor*. + +For example: + + ``"amdgcn-amd-amdhsa--gfx902+xnack"`` + +.. _amdgpu-amdhsa-code-object-metadata: + +Code Object Metadata +~~~~~~~~~~~~~~~~~~~~ + +The code object metadata specifies extensible metadata associated with the code +objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm +[AMD-ROCm]_. The encoding and semantics of this metadata depends on the code +object version; see :ref:`amdgpu-amdhsa-code-object-metadata-v2` and +:ref:`amdgpu-amdhsa-code-object-metadata-v3`. + +Code object metadata is specified in a note record (see +:ref:`amdgpu-note-records`) and is required when the target triple OS is +``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum +information necessary to support the ROCM kernel queries. For example, the +segment sizes needed in a dispatch packet. In addition, a high-level language +runtime may require other information to be included. For example, the AMD +OpenCL runtime records kernel argument information. + +.. _amdgpu-amdhsa-code-object-metadata-v2: + +Code Object V2 Metadata (-mattr=-code-object-v3) +++++++++++++++++++++++++++++++++++++++++++++++++ + +.. warning:: Code Object V2 is not the default code object version emitted by + this version of LLVM. For a description of the metadata generated with the + default configuration (Code Object V3) see + :ref:`amdgpu-amdhsa-code-object-metadata-v3`. + +Code object V2 metadata is specified by the ``NT_AMD_AMDGPU_METADATA`` note +record (see :ref:`amdgpu-note-records-v2`). + +The metadata is specified as a YAML formatted string (see [YAML]_ and +:doc:`YamlIO`). + +.. TODO:: + + Is the string null terminated? It probably should not if YAML allows it to + contain null characters, otherwise it should be. + +The metadata is represented as a single YAML document comprised of the mapping +defined in table :ref:`amdgpu-amdhsa-code-object-metadata-map-table-v2` and +referenced tables. + +For boolean values, the string values of ``false`` and ``true`` are used for +false and true respectively. + +Additional information can be added to the mappings. To avoid conflicts, any +non-AMD key names should be prefixed by "*vendor-name*.". + + .. table:: AMDHSA Code Object V2 Metadata Map + :name: amdgpu-amdhsa-code-object-metadata-map-table-v2 + + ========== ============== ========= ======================================= + String Key Value Type Required? Description + ========== ============== ========= ======================================= + "Version" sequence of Required - The first integer is the major + 2 integers version. Currently 1. + - The second integer is the minor + version. Currently 0. + "Printf" sequence of Each string is encoded information + strings about a printf function call. The + encoded information is organized as + fields separated by colon (':'): + + ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` + + where: + + ``ID`` + A 32-bit integer as a unique id for + each printf function call + + ``N`` + A 32-bit integer equal to the number + of arguments of printf function call + minus 1 + + ``S[i]`` (where i = 0, 1, ... , N-1) + 32-bit integers for the size in bytes + of the i-th FormatString argument of + the printf function call + + FormatString + The format string passed to the + printf function call. + "Kernels" sequence of Required Sequence of the mappings for each + mapping kernel in the code object. See + :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2` + for the definition of the mapping. + ========== ============== ========= ======================================= + +.. + + .. table:: AMDHSA Code Object V2 Kernel Metadata Map + :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v2 + + ================= ============== ========= ================================ + String Key Value Type Required? Description + ================= ============== ========= ================================ + "Name" string Required Source name of the kernel. + "SymbolName" string Required Name of the kernel + descriptor ELF symbol. + "Language" string Source language of the kernel. + Values include: + + - "OpenCL C" + - "OpenCL C++" + - "HCC" + - "OpenMP" + + "LanguageVersion" sequence of - The first integer is the major + 2 integers version. + - The second integer is the + minor version. + "Attrs" mapping Mapping of kernel attributes. + See + :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2` + for the mapping definition. + "Args" sequence of Sequence of mappings of the + mapping kernel arguments. See + :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2` + for the definition of the mapping. + "CodeProps" mapping Mapping of properties related to + the kernel code. See + :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2` + for the mapping definition. + ================= ============== ========= ================================ + +.. + + .. table:: AMDHSA Code Object V2 Kernel Attribute Metadata Map + :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-map-table-v2 + + =================== ============== ========= ============================== + String Key Value Type Required? Description + =================== ============== ========= ============================== + "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values + 3 integers must be >=1 and the dispatch + work-group size X, Y, Z must + correspond to the specified + values. Defaults to 0, 0, 0. + + Corresponds to the OpenCL + ``reqd_work_group_size`` + attribute. + "WorkGroupSizeHint" sequence of The dispatch work-group size + 3 integers X, Y, Z is likely to be the + specified values. + + Corresponds to the OpenCL + ``work_group_size_hint`` + attribute. + "VecTypeHint" string The name of a scalar or vector + type. + + Corresponds to the OpenCL + ``vec_type_hint`` attribute. + + "RuntimeHandle" string The external symbol name + associated with a kernel. + OpenCL runtime allocates a + global buffer for the symbol + and saves the kernel's address + to it, which is used for + device side enqueueing. Only + available for device side + enqueued kernels. + =================== ============== ========= ============================== + +.. + + .. table:: AMDHSA Code Object V2 Kernel Argument Metadata Map + :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v2 + + ================= ============== ========= ================================ + String Key Value Type Required? Description + ================= ============== ========= ================================ + "Name" string Kernel argument name. + "TypeName" string Kernel argument type name. + "Size" integer Required Kernel argument size in bytes. + "Align" integer Required Kernel argument alignment in + bytes. Must be a power of two. + "ValueKind" string Required Kernel argument kind that + specifies how to set up the + corresponding argument. + Values include: + + "ByValue" + The argument is copied + directly into the kernarg. + + "GlobalBuffer" + A global address space pointer + to the buffer data is passed + in the kernarg. + + "DynamicSharedPointer" + A group address space pointer + to dynamically allocated LDS + is passed in the kernarg. + + "Sampler" + A global address space + pointer to a S# is passed in + the kernarg. + + "Image" + A global address space + pointer to a T# is passed in + the kernarg. + + "Pipe" + A global address space pointer + to an OpenCL pipe is passed in + the kernarg. + + "Queue" + A global address space pointer + to an OpenCL device enqueue + queue is passed in the + kernarg. + + "HiddenGlobalOffsetX" + The OpenCL grid dispatch + global offset for the X + dimension is passed in the + kernarg. + + "HiddenGlobalOffsetY" + The OpenCL grid dispatch + global offset for the Y + dimension is passed in the + kernarg. + + "HiddenGlobalOffsetZ" + The OpenCL grid dispatch + global offset for the Z + dimension is passed in the + kernarg. + + "HiddenNone" + An argument that is not used + by the kernel. Space needs to + be left for it, but it does + not need to be set up. + + "HiddenPrintfBuffer" + A global address space pointer + to the runtime printf buffer + is passed in kernarg. + + "HiddenHostcallBuffer" + A global address space pointer + to the runtime hostcall buffer + is passed in kernarg. + + "HiddenDefaultQueue" + A global address space pointer + to the OpenCL device enqueue + queue that should be used by + the kernel by default is + passed in the kernarg. + + "HiddenCompletionAction" + A global address space pointer + to help link enqueued kernels into + the ancestor tree for determining + when the parent kernel has finished. + + "HiddenMultiGridSyncArg" + A global address space pointer for + multi-grid synchronization is + passed in the kernarg. + + "ValueType" string Unused and deprecated. This should no longer + be emitted, but is accepted for compatibility. + + + "PointeeAlign" integer Alignment in bytes of pointee + type for pointer type kernel + argument. Must be a power + of 2. Only present if + "ValueKind" is + "DynamicSharedPointer". + "AddrSpaceQual" string Kernel argument address space + qualifier. Only present if + "ValueKind" is "GlobalBuffer" or + "DynamicSharedPointer". Values + are: + + - "Private" + - "Global" + - "Constant" + - "Local" + - "Generic" + - "Region" + + .. TODO:: + Is GlobalBuffer only Global + or Constant? Is + DynamicSharedPointer always + Local? Can HCC allow Generic? + How can Private or Region + ever happen? + "AccQual" string Kernel argument access + qualifier. Only present if + "ValueKind" is "Image" or + "Pipe". Values + are: + + - "ReadOnly" + - "WriteOnly" + - "ReadWrite" + + .. TODO:: + Does this apply to + GlobalBuffer? + "ActualAccQual" string The actual memory accesses + performed by the kernel on the + kernel argument. Only present if + "ValueKind" is "GlobalBuffer", + "Image", or "Pipe". This may be + more restrictive than indicated + by "AccQual" to reflect what the + kernel actual does. If not + present then the runtime must + assume what is implied by + "AccQual" and "IsConst". Values + are: + + - "ReadOnly" + - "WriteOnly" + - "ReadWrite" + + "IsConst" boolean Indicates if the kernel argument + is const qualified. Only present + if "ValueKind" is + "GlobalBuffer". + + "IsRestrict" boolean Indicates if the kernel argument + is restrict qualified. Only + present if "ValueKind" is + "GlobalBuffer". + + "IsVolatile" boolean Indicates if the kernel argument + is volatile qualified. Only + present if "ValueKind" is + "GlobalBuffer". + + "IsPipe" boolean Indicates if the kernel argument + is pipe qualified. Only present + if "ValueKind" is "Pipe". + + .. TODO:: + Can GlobalBuffer be pipe + qualified? + ================= ============== ========= ================================ + +.. + + .. table:: AMDHSA Code Object V2 Kernel Code Properties Metadata Map + :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-map-table-v2 + + ============================ ============== ========= ===================== + String Key Value Type Required? Description + ============================ ============== ========= ===================== + "KernargSegmentSize" integer Required The size in bytes of + the kernarg segment + that holds the values + of the arguments to + the kernel. + "GroupSegmentFixedSize" integer Required The amount of group + segment memory + required by a + work-group in + bytes. This does not + include any + dynamically allocated + group segment memory + that may be added + when the kernel is + dispatched. + "PrivateSegmentFixedSize" integer Required The amount of fixed + private address space + memory required for a + work-item in + bytes. If the kernel + uses a dynamic call + stack then additional + space must be added + to this value for the + call stack. + "KernargSegmentAlign" integer Required The maximum byte + alignment of + arguments in the + kernarg segment. Must + be a power of 2. + "WavefrontSize" integer Required Wavefront size. Must + be a power of 2. + "NumSGPRs" integer Required Number of scalar + registers used by a + wavefront for + GFX6-GFX10. This + includes the special + SGPRs for VCC, Flat + Scratch (GFX7-GFX10) + and XNACK (for + GFX8-GFX10). It does + not include the 16 + SGPR added if a trap + handler is + enabled. It is not + rounded up to the + allocation + granularity. + "NumVGPRs" integer Required Number of vector + registers used by + each work-item for + GFX6-GFX10 + "MaxFlatWorkGroupSize" integer Required Maximum flat + work-group size + supported by the + kernel in work-items. + Must be >=1 and + consistent with + ReqdWorkGroupSize if + not 0, 0, 0. + "NumSpilledSGPRs" integer Number of stores from + a scalar register to + a register allocator + created spill + location. + "NumSpilledVGPRs" integer Number of stores from + a vector register to + a register allocator + created spill + location. + ============================ ============== ========= ===================== + +.. _amdgpu-amdhsa-code-object-metadata-v3: + +Code Object V3 Metadata (-mattr=+code-object-v3) +++++++++++++++++++++++++++++++++++++++++++++++++ + +Code object V3 metadata is specified by the ``NT_AMDGPU_METADATA`` note record +(see :ref:`amdgpu-note-records-v3`). + +The metadata is represented as Message Pack formatted binary data (see +[MsgPack]_). The top level is a Message Pack map that includes the +keys defined in table +:ref:`amdgpu-amdhsa-code-object-metadata-map-table-v3` and referenced +tables. + +Additional information can be added to the maps. To avoid conflicts, +any key names should be prefixed by "*vendor-name*." where +``vendor-name`` can be the name of the vendor and specific vendor +tool that generates the information. The prefix is abbreviated to +simply "." when it appears within a map that has been added by the +same *vendor-name*. + + .. table:: AMDHSA Code Object V3 Metadata Map + :name: amdgpu-amdhsa-code-object-metadata-map-table-v3 + + ================= ============== ========= ======================================= + String Key Value Type Required? Description + ================= ============== ========= ======================================= + "amdhsa.version" sequence of Required - The first integer is the major + 2 integers version. Currently 1. + - The second integer is the minor + version. Currently 0. + "amdhsa.printf" sequence of Each string is encoded information + strings about a printf function call. The + encoded information is organized as + fields separated by colon (':'): + + ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` + + where: + + ``ID`` + A 32-bit integer as a unique id for + each printf function call + + ``N`` + A 32-bit integer equal to the number + of arguments of printf function call + minus 1 + + ``S[i]`` (where i = 0, 1, ... , N-1) + 32-bit integers for the size in bytes + of the i-th FormatString argument of + the printf function call + + FormatString + The format string passed to the + printf function call. + "amdhsa.kernels" sequence of Required Sequence of the maps for each + map kernel in the code object. See + :ref:`amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3` + for the definition of the keys included + in that map. + ================= ============== ========= ======================================= + +.. + + .. table:: AMDHSA Code Object V3 Kernel Metadata Map + :name: amdgpu-amdhsa-code-object-kernel-metadata-map-table-v3 + + =================================== ============== ========= ================================ + String Key Value Type Required? Description + =================================== ============== ========= ================================ + ".name" string Required Source name of the kernel. + ".symbol" string Required Name of the kernel + descriptor ELF symbol. + ".language" string Source language of the kernel. + Values include: + + - "OpenCL C" + - "OpenCL C++" + - "HCC" + - "HIP" + - "OpenMP" + - "Assembler" + + ".language_version" sequence of - The first integer is the major + 2 integers version. + - The second integer is the + minor version. + ".args" sequence of Sequence of maps of the + map kernel arguments. See + :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3` + for the definition of the keys + included in that map. + ".reqd_workgroup_size" sequence of If not 0, 0, 0 then all values + 3 integers must be >=1 and the dispatch + work-group size X, Y, Z must + correspond to the specified + values. Defaults to 0, 0, 0. + + Corresponds to the OpenCL + ``reqd_work_group_size`` + attribute. + ".workgroup_size_hint" sequence of The dispatch work-group size + 3 integers X, Y, Z is likely to be the + specified values. + + Corresponds to the OpenCL + ``work_group_size_hint`` + attribute. + ".vec_type_hint" string The name of a scalar or vector + type. + + Corresponds to the OpenCL + ``vec_type_hint`` attribute. + + ".device_enqueue_symbol" string The external symbol name + associated with a kernel. + OpenCL runtime allocates a + global buffer for the symbol + and saves the kernel's address + to it, which is used for + device side enqueueing. Only + available for device side + enqueued kernels. + ".kernarg_segment_size" integer Required The size in bytes of + the kernarg segment + that holds the values + of the arguments to + the kernel. + ".group_segment_fixed_size" integer Required The amount of group + segment memory + required by a + work-group in + bytes. This does not + include any + dynamically allocated + group segment memory + that may be added + when the kernel is + dispatched. + ".private_segment_fixed_size" integer Required The amount of fixed + private address space + memory required for a + work-item in + bytes. If the kernel + uses a dynamic call + stack then additional + space must be added + to this value for the + call stack. + ".kernarg_segment_align" integer Required The maximum byte + alignment of + arguments in the + kernarg segment. Must + be a power of 2. + ".wavefront_size" integer Required Wavefront size. Must + be a power of 2. + ".sgpr_count" integer Required Number of scalar + registers required by a + wavefront for + GFX6-GFX9. A register + is required if it is + used explicitly, or + if a higher numbered + register is used + explicitly. This + includes the special + SGPRs for VCC, Flat + Scratch (GFX7-GFX9) + and XNACK (for + GFX8-GFX9). It does + not include the 16 + SGPR added if a trap + handler is + enabled. It is not + rounded up to the + allocation + granularity. + ".vgpr_count" integer Required Number of vector + registers required by + each work-item for + GFX6-GFX9. A register + is required if it is + used explicitly, or + if a higher numbered + register is used + explicitly. + ".max_flat_workgroup_size" integer Required Maximum flat + work-group size + supported by the + kernel in work-items. + Must be >=1 and + consistent with + ReqdWorkGroupSize if + not 0, 0, 0. + ".sgpr_spill_count" integer Number of stores from + a scalar register to + a register allocator + created spill + location. + ".vgpr_spill_count" integer Number of stores from + a vector register to + a register allocator + created spill + location. + =================================== ============== ========= ================================ + +.. + + .. table:: AMDHSA Code Object V3 Kernel Argument Metadata Map + :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3 + + ====================== ============== ========= ================================ + String Key Value Type Required? Description + ====================== ============== ========= ================================ + ".name" string Kernel argument name. + ".type_name" string Kernel argument type name. + ".size" integer Required Kernel argument size in bytes. + ".offset" integer Required Kernel argument offset in + bytes. The offset must be a + multiple of the alignment + required by the argument. + ".value_kind" string Required Kernel argument kind that + specifies how to set up the + corresponding argument. + Values include: + + "by_value" + The argument is copied + directly into the kernarg. + + "global_buffer" + A global address space pointer + to the buffer data is passed + in the kernarg. + + "dynamic_shared_pointer" + A group address space pointer + to dynamically allocated LDS + is passed in the kernarg. + + "sampler" + A global address space + pointer to a S# is passed in + the kernarg. + + "image" + A global address space + pointer to a T# is passed in + the kernarg. + + "pipe" + A global address space pointer + to an OpenCL pipe is passed in + the kernarg. + + "queue" + A global address space pointer + to an OpenCL device enqueue + queue is passed in the + kernarg. + + "hidden_global_offset_x" + The OpenCL grid dispatch + global offset for the X + dimension is passed in the + kernarg. + + "hidden_global_offset_y" + The OpenCL grid dispatch + global offset for the Y + dimension is passed in the + kernarg. + + "hidden_global_offset_z" + The OpenCL grid dispatch + global offset for the Z + dimension is passed in the + kernarg. + + "hidden_none" + An argument that is not used + by the kernel. Space needs to + be left for it, but it does + not need to be set up. + + "hidden_printf_buffer" + A global address space pointer + to the runtime printf buffer + is passed in kernarg. + + "hidden_hostcall_buffer" + A global address space pointer + to the runtime hostcall buffer + is passed in kernarg. + + "hidden_default_queue" + A global address space pointer + to the OpenCL device enqueue + queue that should be used by + the kernel by default is + passed in the kernarg. + + "hidden_completion_action" + A global address space pointer + to help link enqueued kernels into + the ancestor tree for determining + when the parent kernel has finished. + + "hidden_multigrid_sync_arg" + A global address space pointer for + multi-grid synchronization is + passed in the kernarg. + + ".value_type" string Unused and deprecated. This should no longer + be emitted, but is accepted for compatibility. + + ".pointee_align" integer Alignment in bytes of pointee + type for pointer type kernel + argument. Must be a power + of 2. Only present if + ".value_kind" is + "dynamic_shared_pointer". + ".address_space" string Kernel argument address space + qualifier. Only present if + ".value_kind" is "global_buffer" or + "dynamic_shared_pointer". Values + are: + + - "private" + - "global" + - "constant" + - "local" + - "generic" + - "region" + + .. TODO:: + Is "global_buffer" only "global" + or "constant"? Is + "dynamic_shared_pointer" always + "local"? Can HCC allow "generic"? + How can "private" or "region" + ever happen? + ".access" string Kernel argument access + qualifier. Only present if + ".value_kind" is "image" or + "pipe". Values + are: + + - "read_only" + - "write_only" + - "read_write" + + .. TODO:: + Does this apply to + "global_buffer"? + ".actual_access" string The actual memory accesses + performed by the kernel on the + kernel argument. Only present if + ".value_kind" is "global_buffer", + "image", or "pipe". This may be + more restrictive than indicated + by ".access" to reflect what the + kernel actual does. If not + present then the runtime must + assume what is implied by + ".access" and ".is_const" . Values + are: + + - "read_only" + - "write_only" + - "read_write" + + ".is_const" boolean Indicates if the kernel argument + is const qualified. Only present + if ".value_kind" is + "global_buffer". + + ".is_restrict" boolean Indicates if the kernel argument + is restrict qualified. Only + present if ".value_kind" is + "global_buffer". + + ".is_volatile" boolean Indicates if the kernel argument + is volatile qualified. Only + present if ".value_kind" is + "global_buffer". + + ".is_pipe" boolean Indicates if the kernel argument + is pipe qualified. Only present + if ".value_kind" is "pipe". + + .. TODO:: + Can "global_buffer" be pipe + qualified? + ====================== ============== ========= ================================ + +.. + +Kernel Dispatch +~~~~~~~~~~~~~~~ + +The HSA architected queuing language (AQL) defines a user space memory +interface that can be used to control the dispatch of kernels, in an agent +independent way. An agent can have zero or more AQL queues created for it using +the ROCm runtime, in which AQL packets (all of which are 64 bytes) can be +placed. See the *HSA Platform System Architecture Specification* [HSA]_ for the +AQL queue mechanics and packet layouts. + +The packet processor of a kernel agent is responsible for detecting and +dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the +packet processor is implemented by the hardware command processor (CP), +asynchronous dispatch controller (ADC) and shader processor input controller +(SPI). + +The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel +mode driver to initialize and register the AQL queue with CP. + +To dispatch a kernel the following actions are performed. This can occur in the +CPU host program, or from an HSA kernel executing on a GPU. + +1. A pointer to an AQL queue for the kernel agent on which the kernel is to be + executed is obtained. +2. A pointer to the kernel descriptor (see + :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is obtained. + It must be for a kernel that is contained in a code object that that was + loaded by the ROCm runtime on the kernel agent with which the AQL queue is + associated. +3. Space is allocated for the kernel arguments using the ROCm runtime allocator + for a memory region with the kernarg property for the kernel agent that will + execute the kernel. It must be at least 16-byte aligned. +4. Kernel argument values are assigned to the kernel argument memory + allocation. The layout is defined in the *HSA Programmer's Language + Reference* [HSA]_. For AMDGPU the kernel execution directly accesses the + kernel argument memory in the same way constant memory is accessed. (Note + that the HSA specification allows an implementation to copy the kernel + argument contents to another location that is accessed by the kernel.) +5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime + api uses 64-bit atomic operations to reserve space in the AQL queue for the + packet. The packet must be set up, and the final write must use an atomic + store release to set the packet kind to ensure the packet contents are + visible to the kernel agent. AQL defines a doorbell signal mechanism to + notify the kernel agent that the AQL queue has been updated. These rules, and + the layout of the AQL queue and kernel dispatch packet is defined in the *HSA + System Architecture Specification* [HSA]_. +6. A kernel dispatch packet includes information about the actual dispatch, + such as grid and work-group size, together with information from the code + object about the kernel, such as segment sizes. The ROCm runtime queries on + the kernel symbol can be used to obtain the code object values which are + recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. +7. CP executes micro-code and is responsible for detecting and setting up the + GPU to execute the wavefronts of a kernel dispatch. +8. CP ensures that when the a wavefront starts executing the kernel machine + code, the scalar general purpose registers (SGPR) and vector general purpose + registers (VGPR) are set up as required by the machine code. The required + setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial + register state is defined in + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. +9. The prolog of the kernel machine code (see + :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary + before continuing executing the machine code that corresponds to the kernel. +10. When the kernel dispatch has completed execution, CP signals the completion + signal specified in the kernel dispatch packet if not 0. + +Image and Samplers +~~~~~~~~~~~~~~~~~~ + +Image and sample handles created by the ROCm runtime are 64-bit addresses of a +hardware 32-byte V# and 48 byte S# object respectively. In order to support the +HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG +enumeration values for the queries that are not trivially deducible from the S# +representation. + +HSA Signals +~~~~~~~~~~~ + +HSA signal handles created by the ROCm runtime are 64-bit addresses of a +structure allocated in memory accessible from both the CPU and GPU. The +structure is defined by the ROCm runtime and subject to change between releases +(see [AMD-ROCm-github]_). + +.. _amdgpu-amdhsa-hsa-aql-queue: + +HSA AQL Queue +~~~~~~~~~~~~~ + +The HSA AQL queue structure is defined by the ROCm runtime and subject to change +between releases (see [AMD-ROCm-github]_). For some processors it contains +fields needed to implement certain language features such as the flat address +aperture bases. It also contains fields used by CP such as managing the +allocation of scratch memory. + +.. _amdgpu-amdhsa-kernel-descriptor: + +Kernel Descriptor +~~~~~~~~~~~~~~~~~ + +A kernel descriptor consists of the information needed by CP to initiate the +execution of a kernel, including the entry point address of the machine code +that implements the kernel. + +Kernel Descriptor for GFX6-GFX10 +++++++++++++++++++++++++++++++++ + +CP microcode requires the Kernel descriptor to be allocated on 64-byte +alignment. + + .. table:: Kernel Descriptor for GFX6-GFX10 + :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table + + ======= ======= =============================== ============================ + Bits Size Field Name Description + ======= ======= =============================== ============================ + 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local + address space memory + required for a work-group + in bytes. This does not + include any dynamically + allocated local address + space memory that may be + added when the kernel is + dispatched. + 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed + private address space + memory required for a + work-item in bytes. If + is_dynamic_callstack is 1 + then additional space must + be added to this value for + the call stack. + 127:64 8 bytes Reserved, must be 0. + 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly + negative) from base + address of kernel + descriptor to kernel's + entry point instruction + which must be 256 byte + aligned. + 351:272 20 Reserved, must be 0. + bytes + 383:352 4 bytes COMPUTE_PGM_RSRC3 GFX6-9 + Reserved, must be 0. + GFX10 + Compute Shader (CS) + program settings used by + CP to set up + ``COMPUTE_PGM_RSRC3`` + configuration + register. See + :ref:`amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table`. + 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) + program settings used by + CP to set up + ``COMPUTE_PGM_RSRC1`` + configuration + register. See + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) + program settings used by + CP to set up + ``COMPUTE_PGM_RSRC2`` + configuration + register. See + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the + _BUFFER SGPR user data registers + (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + The total number of SGPR + user data registers + requested must not exceed + 16 and match value in + ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. + Any requests beyond 16 + will be ignored. + 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above* + 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above* + 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above* + 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above* + 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above* + 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above* + _SIZE + 457:455 3 bits Reserved, must be 0. + 458 1 bit ENABLE_WAVEFRONT_SIZE32 GFX6-9 + Reserved, must be 0. + GFX10 + - If 0 execute in + wavefront size 64 mode. + - If 1 execute in + native wavefront size + 32 mode. + 463:459 5 bits Reserved, must be 0. + 511:464 6 bytes Reserved, must be 0. + 512 **Total size 64 bytes.** + ======= ==================================================================== + +.. + + .. table:: compute_pgm_rsrc1 for GFX6-GFX10 + :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table + + ======= ======= =============================== =========================================================================== + Bits Size Field Name Description + ======= ======= =============================== =========================================================================== + 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register + blocks used by each work-item; + granularity is device + specific: + + GFX6-GFX9 + - vgprs_used 0..256 + - max(0, ceil(vgprs_used / 4) - 1) + GFX10 (wavefront size 64) + - max_vgpr 1..256 + - max(0, ceil(vgprs_used / 4) - 1) + GFX10 (wavefront size 32) + - max_vgpr 1..256 + - max(0, ceil(vgprs_used / 8) - 1) + + Where vgprs_used is defined + as the highest VGPR number + explicitly referenced plus + one. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.VGPRS``. + + The + :ref:`amdgpu-assembler` + calculates this + automatically for the + selected processor from + values provided to the + `.amdhsa_kernel` directive + by the + `.amdhsa_next_free_vgpr` + nested directive (see + :ref:`amdhsa-kernel-directives-table`). + 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register + blocks used by a wavefront; + granularity is device + specific: + + GFX6-GFX8 + - sgprs_used 0..112 + - max(0, ceil(sgprs_used / 8) - 1) + GFX9 + - sgprs_used 0..112 + - 2 * max(0, ceil(sgprs_used / 16) - 1) + GFX10 + Reserved, must be 0. + (128 SGPRs always + allocated.) + + Where sgprs_used is + defined as the highest + SGPR number explicitly + referenced plus one, plus + a target specific number + of additional special + SGPRs for VCC, + FLAT_SCRATCH (GFX7+) and + XNACK_MASK (GFX8+), and + any additional + target specific + limitations. It does not + include the 16 SGPRs added + if a trap handler is + enabled. + + The target specific + limitations and special + SGPR layout are defined in + the hardware + documentation, which can + be found in the + :ref:`amdgpu-processors` + table. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.SGPRS``. + + The + :ref:`amdgpu-assembler` + calculates this + automatically for the + selected processor from + values provided to the + `.amdhsa_kernel` directive + by the + `.amdhsa_next_free_sgpr` + and `.amdhsa_reserve_*` + nested directives (see + :ref:`amdhsa-kernel-directives-table`). + 11:10 2 bits PRIORITY Must be 0. + + Start executing wavefront + at the specified priority. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.PRIORITY``. + 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution + with specified rounding + mode for single (32 + bit) floating point + precision floating point + operations. + + Floating point rounding + mode values are defined in + :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. + 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution + with specified rounding + denorm mode for half/double (16 + and 64-bit) floating point + precision floating point + operations. + + Floating point rounding + mode values are defined in + :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. + 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution + with specified denorm mode + for single (32 + bit) floating point + precision floating point + operations. + + Floating point denorm mode + values are defined in + :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. + 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution + with specified denorm mode + for half/double (16 + and 64-bit) floating point + precision floating point + operations. + + Floating point denorm mode + values are defined in + :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. + 20 1 bit PRIV Must be 0. + + Start executing wavefront + in privilege trap handler + mode. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.PRIV``. + 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution + with DX10 clamp mode + enabled. Used by the vector + ALU to force DX10 style + treatment of NaN's (when + set, clamp NaN to zero, + otherwise pass NaN + through). + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. + 22 1 bit DEBUG_MODE Must be 0. + + Start executing wavefront + in single step mode. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. + 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution + with IEEE mode + enabled. Floating point + opcodes that support + exception flag gathering + will quiet and propagate + signaling-NaN inputs per + IEEE 754-2008. Min_dx10 and + max_dx10 become IEEE + 754-2008 compliant due to + signaling-NaN propagation + and quieting. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.IEEE_MODE``. + 24 1 bit BULKY Must be 0. + + Only one work-group allowed + to execute on a compute + unit. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.BULKY``. + 25 1 bit CDBG_USER Must be 0. + + Flag that can be used to + control debugging code. + + CP is responsible for + filling in + ``COMPUTE_PGM_RSRC1.CDBG_USER``. + 26 1 bit FP16_OVFL GFX6-GFX8 + Reserved, must be 0. + GFX9-GFX10 + Wavefront starts execution + with specified fp16 overflow + mode. + + - If 0, fp16 overflow generates + +/-INF values. + - If 1, fp16 overflow that is the + result of an +/-INF input value + or divide by 0 produces a +/-INF, + otherwise clamps computed + overflow to +/-MAX_FP16 as + appropriate. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FP16_OVFL``. + 28:27 2 bits Reserved, must be 0. + 29 1 bit WGP_MODE GFX6-GFX9 + Reserved, must be 0. + GFX10 + - If 0 execute work-groups in + CU wavefront execution mode. + - If 1 execute work-groups on + in WGP wavefront execution mode. + + See :ref:`amdgpu-amdhsa-memory-model`. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.WGP_MODE``. + 30 1 bit MEM_ORDERED GFX6-9 + Reserved, must be 0. + GFX10 + Controls the behavior of the + waitcnt's vmcnt and vscnt + counters. + + - If 0 vmcnt reports completion + of load and atomic with return + out of order with sample + instructions, and the vscnt + reports the completion of + store and atomic without + return in order. + - If 1 vmcnt reports completion + of load, atomic with return + and sample instructions in + order, and the vscnt reports + the completion of store and + atomic without return in order. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.MEM_ORDERED``. + 31 1 bit FWD_PROGRESS GFX6-9 + Reserved, must be 0. + GFX10 + - If 0 execute SIMD wavefronts + using oldest first policy. + - If 1 execute SIMD wavefronts to + ensure wavefronts will make some + forward progress. + + Used by CP to set up + ``COMPUTE_PGM_RSRC1.FWD_PROGRESS``. + 32 **Total size 4 bytes** + ======= =================================================================================================================== + +.. + + .. table:: compute_pgm_rsrc2 for GFX6-GFX10 + :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table + + ======= ======= =============================== =========================================================================== + Bits Size Field Name Description + ======= ======= =============================== =========================================================================== + 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the + _WAVEFRONT_OFFSET SGPR wavefront scratch offset + system register (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. + 5:1 5 bits USER_SGPR_COUNT The total number of SGPR + user data registers + requested. This number must + match the number of user + data registers enabled. + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.USER_SGPR``. + 6 1 bit ENABLE_TRAP_HANDLER Must be 0. + + This bit represents + ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, + which is set by the CP if + the runtime has installed a + trap handler. + 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the + system SGPR register for + the work-group id in the X + dimension (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TGID_X_EN``. + 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the + system SGPR register for + the work-group id in the Y + dimension (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. + 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the + system SGPR register for + the work-group id in the Z + dimension (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. + 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the + system SGPR register for + work-group information (see + :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. + 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the + VGPR system registers used + for the work-item ID. + :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` + defines the values. + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. + 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. + + Wavefront starts execution + with address watch + exceptions enabled which + are generated when L1 has + witnessed a thread access + an *address of + interest*. + + CP is responsible for + filling in the address + watch bit in + ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` + according to what the + runtime requests. + 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. + + Wavefront starts execution + with memory violation + exceptions exceptions + enabled which are generated + when a memory violation has + occurred for this wavefront from + L1 or LDS + (write-to-read-only-memory, + mis-aligned atomic, LDS + address out of range, + illegal address, etc.). + + CP sets the memory + violation bit in + ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` + according to what the + runtime requests. + 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. + + CP uses the rounded value + from the dispatch packet, + not this value, as the + dispatch may contain + dynamically allocated group + segment memory. CP writes + directly to + ``COMPUTE_PGM_RSRC2.LDS_SIZE``. + + Amount of group segment + (LDS) to allocate for each + work-group. Granularity is + device specific: + + GFX6: + roundup(lds-size / (64 * 4)) + GFX7-GFX10: + roundup(lds-size / (128 * 4)) + + 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution + _INVALID_OPERATION with specified exceptions + enabled. + + Used by CP to set up + ``COMPUTE_PGM_RSRC2.EXCP_EN`` + (set from bits 0..6). + + IEEE 754 FP Invalid + Operation + 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more + _SOURCE input operands is a + denormal number + 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by + _DIVISION_BY_ZERO Zero + 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow + _OVERFLOW + 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow + _UNDERFLOW + 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact + _INEXACT + 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero + _ZERO (rcp_iflag_f32 instruction + only) + 31 1 bit Reserved, must be 0. + 32 **Total size 4 bytes.** + ======= =================================================================================================================== + +.. + + .. table:: compute_pgm_rsrc3 for GFX10 + :name: amdgpu-amdhsa-compute_pgm_rsrc3-gfx10-table + + ======= ======= =============================== =========================================================================== + Bits Size Field Name Description + ======= ======= =============================== =========================================================================== + 3:0 4 bits SHARED_VGPR_COUNT Number of shared VGPRs for wavefront size 64. Granularity 8. Value 0-120. + compute_pgm_rsrc1.vgprs + shared_vgpr_cnt cannot exceed 64. + 31:4 28 Reserved, must be 0. + bits + 32 **Total size 4 bytes.** + ======= =================================================================================================================== + +.. + + .. table:: Floating Point Rounding Mode Enumeration Values + :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table + + ====================================== ===== ============================== + Enumeration Name Value Description + ====================================== ===== ============================== + FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even + FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity + FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity + FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 + ====================================== ===== ============================== + +.. + + .. table:: Floating Point Denorm Mode Enumeration Values + :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table + + ====================================== ===== ============================== + Enumeration Name Value Description + ====================================== ===== ============================== + FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination + Denorms + FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms + FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms + FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush + ====================================== ===== ============================== + +.. + + .. table:: System VGPR Work-Item ID Enumeration Values + :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table + + ======================================== ===== ============================ + Enumeration Name Value Description + ======================================== ===== ============================ + SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension + ID. + SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y + dimensions ID. + SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z + dimensions ID. + SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. + ======================================== ===== ============================ + +.. _amdgpu-amdhsa-initial-kernel-execution-state: + +Initial Kernel Execution State +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This section defines the register state that will be set up by the packet +processor prior to the start of execution of every wavefront. This is limited by +the constraints of the hardware controllers of CP/ADC/SPI. + +The order of the SGPR registers is defined, but the compiler can specify which +ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit +fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used +for enabled registers are dense starting at SGPR0: the first enabled register is +SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have +an SGPR number. + +The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to +all wavefronts of the grid. It is possible to specify more than 16 User SGPRs +using the ``enable_sgpr_*`` bit fields, in which case only the first 16 are +actually initialized. These are then immediately followed by the System SGPRs +that are set up by ADC/SPI and can have different values for each wavefront of +the grid dispatch. + +SGPR register initial state is defined in +:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + + .. table:: SGPR Register Set Up Order + :name: amdgpu-amdhsa-sgpr-register-set-up-order-table + + ========== ========================== ====== ============================== + SGPR Order Name Number Description + (kernel descriptor enable of + field) SGPRs + ========== ========================== ====== ============================== + First Private Segment Buffer 4 V# that can be used, together + (enable_sgpr_private with Scratch Wavefront Offset + _segment_buffer) as an offset, to access the + private address space using a + segment address. + + CP uses the value provided by + the runtime. + then Dispatch Ptr 2 64-bit address of AQL dispatch + (enable_sgpr_dispatch_ptr) packet for kernel dispatch + actually executing. + then Queue Ptr 2 64-bit address of amd_queue_t + (enable_sgpr_queue_ptr) object for AQL queue on which + the dispatch packet was + queued. + then Kernarg Segment Ptr 2 64-bit address of Kernarg + (enable_sgpr_kernarg segment. This is directly + _segment_ptr) copied from the + kernarg_address in the kernel + dispatch packet. + + Having CP load it once avoids + loading it at the beginning of + every wavefront. + then Dispatch Id 2 64-bit Dispatch ID of the + (enable_sgpr_dispatch_id) dispatch packet being + executed. + then Flat Scratch Init 2 This is 2 SGPRs: + (enable_sgpr_flat_scratch + _init) GFX6 + Not supported. + GFX7-GFX8 + The first SGPR is a 32-bit + byte offset from + ``SH_HIDDEN_PRIVATE_BASE_VIMID`` + to per SPI base of memory + for scratch for the queue + executing the kernel + dispatch. CP obtains this + from the runtime. (The + Scratch Segment Buffer base + address is + ``SH_HIDDEN_PRIVATE_BASE_VIMID`` + plus this offset.) The value + of Scratch Wavefront Offset must + be added to this offset by + the kernel machine code, + right shifted by 8, and + moved to the FLAT_SCRATCH_HI + SGPR register. + FLAT_SCRATCH_HI corresponds + to SGPRn-4 on GFX7, and + SGPRn-6 on GFX8 (where SGPRn + is the highest numbered SGPR + allocated to the wavefront). + FLAT_SCRATCH_HI is + multiplied by 256 (as it is + in units of 256 bytes) and + added to + ``SH_HIDDEN_PRIVATE_BASE_VIMID`` + to calculate the per wavefront + FLAT SCRATCH BASE in flat + memory instructions that + access the scratch + aperture. + + The second SGPR is 32-bit + byte size of a single + work-item's scratch memory + usage. CP obtains this from + the runtime, and it is + always a multiple of DWORD. + CP checks that the value in + the kernel dispatch packet + Private Segment Byte Size is + not larger and requests the + runtime to increase the + queue's scratch size if + necessary. The kernel code + must move it to + FLAT_SCRATCH_LO which is + SGPRn-3 on GFX7 and SGPRn-5 + on GFX8. FLAT_SCRATCH_LO is + used as the FLAT SCRATCH + SIZE in flat memory + instructions. Having CP load + it once avoids loading it at + the beginning of every + wavefront. + GFX9-GFX10 + This is the + 64-bit base address of the + per SPI scratch backing + memory managed by SPI for + the queue executing the + kernel dispatch. CP obtains + this from the runtime (and + divides it if there are + multiple Shader Arrays each + with its own SPI). The value + of Scratch Wavefront Offset must + be added by the kernel + machine code and the result + moved to the FLAT_SCRATCH + SGPR which is SGPRn-6 and + SGPRn-5. It is used as the + FLAT SCRATCH BASE in flat + memory instructions. + then Private Segment Size 1 The 32-bit byte size of a + (enable_sgpr_private single + work-item's + scratch_segment_size) memory + allocation. This is the + value from the kernel + dispatch packet Private + Segment Byte Size rounded up + by CP to a multiple of + DWORD. + + Having CP load it once avoids + loading it at the beginning of + every wavefront. + + This is not used for + GFX7-GFX8 since it is the same + value as the second SGPR of + Flat Scratch Init. However, it + may be needed for GFX9-GFX10 which + changes the meaning of the + Flat Scratch Init value. + then Grid Work-Group Count X 1 32-bit count of the number of + (enable_sgpr_grid work-groups in the X dimension + _workgroup_count_X) for the grid being + executed. Computed from the + fields in the kernel dispatch + packet as ((grid_size.x + + workgroup_size.x - 1) / + workgroup_size.x). + then Grid Work-Group Count Y 1 32-bit count of the number of + (enable_sgpr_grid work-groups in the Y dimension + _workgroup_count_Y && for the grid being + less than 16 previous executed. Computed from the + SGPRs) fields in the kernel dispatch + packet as ((grid_size.y + + workgroup_size.y - 1) / + workgroupSize.y). + + Only initialized if <16 + previous SGPRs initialized. + then Grid Work-Group Count Z 1 32-bit count of the number of + (enable_sgpr_grid work-groups in the Z dimension + _workgroup_count_Z && for the grid being + less than 16 previous executed. Computed from the + SGPRs) fields in the kernel dispatch + packet as ((grid_size.z + + workgroup_size.z - 1) / + workgroupSize.z). + + Only initialized if <16 + previous SGPRs initialized. + then Work-Group Id X 1 32-bit work-group id in X + (enable_sgpr_workgroup_id dimension of grid for + _X) wavefront. + then Work-Group Id Y 1 32-bit work-group id in Y + (enable_sgpr_workgroup_id dimension of grid for + _Y) wavefront. + then Work-Group Id Z 1 32-bit work-group id in Z + (enable_sgpr_workgroup_id dimension of grid for + _Z) wavefront. + then Work-Group Info 1 {first_wavefront, 14'b0000, + (enable_sgpr_workgroup ordered_append_term[10:0], + _info) threadgroup_size_in_wavefronts[5:0]} + then Scratch Wavefront Offset 1 32-bit byte offset from base + (enable_sgpr_private of scratch base of queue + _segment_wavefront_offset) executing the kernel + dispatch. Must be used as an + offset with Private + segment address when using + Scratch Segment Buffer. It + must be used to set up FLAT + SCRATCH for flat addressing + (see + :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`). + ========== ========================== ====== ============================== + +The order of the VGPR registers is defined, but the compiler can specify which +ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit +fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used +for enabled registers are dense starting at VGPR0: the first enabled register is +VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a +VGPR number. + +VGPR register initial state is defined in +:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. + + .. table:: VGPR Register Set Up Order + :name: amdgpu-amdhsa-vgpr-register-set-up-order-table + + ========== ========================== ====== ============================== + VGPR Order Name Number Description + (kernel descriptor enable of + field) VGPRs + ========== ========================== ====== ============================== + First Work-Item Id X 1 32-bit work item id in X + (Always initialized) dimension of work-group for + wavefront lane. + then Work-Item Id Y 1 32-bit work item id in Y + (enable_vgpr_workitem_id dimension of work-group for + > 0) wavefront lane. + then Work-Item Id Z 1 32-bit work item id in Z + (enable_vgpr_workitem_id dimension of work-group for + > 1) wavefront lane. + ========== ========================== ====== ============================== + +The setting of registers is done by GPU CP/ADC/SPI hardware as follows: + +1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data + registers. +2. Work-group Id registers X, Y, Z are set by ADC which supports any + combination including none. +3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why + its value cannot be included with the flat scratch init value which is per + queue. +4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) + or (X, Y, Z). + +Flat Scratch register pair are adjacent SGPRs so they can be moved as a 64-bit +value to the hardware required SGPRn-3 and SGPRn-4 respectively. + +The global segment can be accessed either using buffer instructions (GFX6 which +has V# 64-bit address support), flat instructions (GFX7-GFX10), or global +instructions (GFX9-GFX10). + +If buffer operations are used, then the compiler can generate a V# with the +following properties: + +* base address of 0 +* no swizzle +* ATC: 1 if IOMMU present (such as APU) +* ptr64: 1 +* MTYPE set to support memory coherence that matches the runtime (such as CC for + APU and NC for dGPU). + +.. _amdgpu-amdhsa-kernel-prolog: + +Kernel Prolog +~~~~~~~~~~~~~ + +The compiler performs initialization in the kernel prologue depending on the +target and information about things like stack usage in the kernel and called +functions. Some of this initialization requires the compiler to request certain +User and System SGPRs be present in the +:ref:`amdgpu-amdhsa-initial-kernel-execution-state` via the +:ref:`amdgpu-amdhsa-kernel-descriptor`. + +.. _amdgpu-amdhsa-kernel-prolog-cfi: + +CFI ++++ + +1. The CFI return address is undefined. + +2. The CFI CFA is defined using an expression which evaluates to a location + description that comprises one memory location description for the + ``DW_ASPACE_AMDGPU_private_lane`` address space address ``0``. + +.. _amdgpu-amdhsa-kernel-prolog-m0: + +M0 +++ + +GFX6-GFX8 + The M0 register must be initialized with a value at least the total LDS size + if the kernel may access LDS via DS or flat operations. Total LDS size is + available in dispatch packet. For M0, it is also possible to use maximum + possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for + GFX7-GFX8). +GFX9-GFX10 + The M0 register is not used for range checking LDS accesses and so does not + need to be initialized in the prolog. + +.. _amdgpu-amdhsa-kernel-prolog-stack-pointer: + +Stack Pointer ++++++++++++++ + +If the kernel has function calls it must set up the ABI stack pointer described +in :ref:`amdgpu-amdhsa-function-call-convention-non-kernel-functions` by setting +SGPR32 to the unswizzled scratch offset of the address past the last local +allocation. + +.. _amdgpu-amdhsa-kernel-prolog-frame-pointer: + +Frame Pointer ++++++++++++++ + +If the kernel needs a frame pointer for the reasons defined in +``SIFrameLowering`` then SGPR33 is used and is always set to ``0`` in the +kernel prolog. If a frame pointer is not required then all uses of the frame +pointer are replaced with immediate ``0`` offsets. + +.. _amdgpu-amdhsa-kernel-prolog-flat-scratch: + +Flat Scratch +++++++++++++ + +If the kernel or any function it calls may use flat operations to access +scratch memory, the prolog code must set up the FLAT_SCRATCH register pair +(FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which are in SGPRn-4/SGPRn-3). Initialization +uses Flat Scratch Init and Scratch Wavefront Offset SGPR registers (see +:ref:`amdgpu-amdhsa-initial-kernel-execution-state`): + +GFX6 + Flat scratch is not supported. + +GFX7-GFX8 + + 1. The low word of Flat Scratch Init is 32-bit byte offset from + ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory + being managed by SPI for the queue executing the kernel dispatch. This is + the same value used in the Scratch Segment Buffer V# base address. The + prolog must add the value of Scratch Wavefront Offset to get the + wavefront's byte scratch backing memory offset from + ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since FLAT_SCRATCH_LO is in units of 256 + bytes, the offset must be right shifted by 8 before moving into + FLAT_SCRATCH_LO. + 2. The second word of Flat Scratch Init is 32-bit byte size of a single + work-items scratch memory usage. This is directly loaded from the kernel + dispatch packet Private Segment Byte Size and rounded up to a multiple of + DWORD. Having CP load it once avoids loading it at the beginning of every + wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT + SCRATCH SIZE. + +GFX9-GFX10 + The Flat Scratch Init is the 64-bit address of the base of scratch backing + memory being managed by SPI for the queue executing the kernel dispatch. The + prolog must add the value of Scratch Wavefront Offset and moved to the + FLAT_SCRATCH pair for use as the flat scratch base in flat memory + instructions. + +.. _amdgpu-amdhsa-kernel-prolog-private-segment-buffer: + +Private Segment Buffer +++++++++++++++++++++++ + +A set of four SGPRs beginning at a four-aligned SGPR index are always selected +to serve as the scratch V# for the kernel as follows: + + - If it is known during instruction selection that there is stack usage, + SGPR0-3 is reserved for use as the scratch V#. Stack usage is assumed if + optimizations are disabled (``-O0``), if stack objects already exist (for + locals, etc.), or if there are any function calls. + + - Otherwise, four high numbered SGPRs beginning at a four-aligned SGPR index + are reserved for the tentative scratch V#. These will be used if it is + determined that spilling is needed. + + - If no use is made of the tentative scratch V#, then it is unreserved, + and the register count is determined ignoring it. + - If use is made of the tentative scratch V#, then its register numbers + are shifted to the first four-aligned SGPR index after the highest one + allocated by the register allocator, and all uses are updated. The + register count includes them in the shifted location. + - In either case, if the processor has the SGPR allocation bug, the + tentative allocation is not shifted or unreserved in order to ensure + the register count is higher to workaround the bug. + + .. note:: + + This approach of using a tentative scratch V# and shifting the register + numbers if used avoids having to perform register allocation a second + time if the tentative V# is eliminated. This is more efficient and + avoids the problem that the second register allocation may perform + spilling which will fail as there is no longer a scratch V#. + +When the kernel prolog code is being emitted it is known whether the scratch V# +described above is actually used. If it is, the prolog code must set it up by +copying the Private Segment Buffer to the scratch V# registers and then adding +the Private Segment Wavefront Offset to the queue base address in the V#. The +result is a V# with a base address pointing to the beginning of the wavefront +scratch backing memory. + +The Private Segment Buffer is always requested, but the Private Segment +Wavefront Offset is only requested if it is used (see +:ref:`amdgpu-amdhsa-initial-kernel-execution-state`). + +.. _amdgpu-amdhsa-memory-model: + +Memory Model +~~~~~~~~~~~~ + +This section describes the mapping of LLVM memory model onto AMDGPU machine code +(see :ref:`memmodel`). + +The AMDGPU backend supports the memory synchronization scopes specified in +:ref:`amdgpu-memory-scopes`. + +The code sequences used to implement the memory model are defined in table +:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table`. + +The sequences specify the order of instructions that a single thread must +execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect +to other memory instructions executed by the same thread. This allows them to be +moved earlier or later which can allow them to be combined with other instances +of the same instruction, or hoisted/sunk out of loops to improve +performance. Only the instructions related to the memory model are given; +additional ``s_waitcnt`` instructions are required to ensure registers are +defined before being used. These may be able to be combined with the memory +model ``s_waitcnt`` instructions as described above. + +The AMDGPU backend supports the following memory models: + + HSA Memory Model [HSA]_ + The HSA memory model uses a single happens-before relation for all address + spaces (see :ref:`amdgpu-address-spaces`). + OpenCL Memory Model [OpenCL]_ + The OpenCL memory model which has separate happens-before relations for the + global and local address spaces. Only a fence specifying both global and + local address space, and seq_cst instructions join the relationships. Since + the LLVM ``memfence`` instruction does not allow an address space to be + specified the OpenCL fence has to conservatively assume both local and + global address space was specified. However, optimizations can often be + done to eliminate the additional ``s_waitcnt`` instructions when there are + no intervening memory instructions which access the corresponding address + space. The code sequences in the table indicate what can be omitted for the + OpenCL memory. The target triple environment is used to determine if the + source language is OpenCL (see :ref:`amdgpu-opencl`). + +``ds/flat_load/store/atomic`` instructions to local memory are termed LDS +operations. + +``buffer/global/flat_load/store/atomic`` instructions to global memory are +termed vector memory operations. + +For GFX6-GFX9: + +* Each agent has multiple shader arrays (SA). +* Each SA has multiple compute units (CU). +* Each CU has multiple SIMDs that execute wavefronts. +* The wavefronts for a single work-group are executed in the same CU but may be + executed by different SIMDs. +* Each CU has a single LDS memory shared by the wavefronts of the work-groups + executing on it. +* All LDS operations of a CU are performed as wavefront wide operations in a + global order and involve no caching. Completion is reported to a wavefront in + execution order. +* The LDS memory has multiple request queues shared by the SIMDs of a + CU. Therefore, the LDS operations performed by different wavefronts of a + work-group can be reordered relative to each other, which can result in + reordering the visibility of vector memory operations with respect to LDS + operations of other wavefronts in the same work-group. A ``s_waitcnt + lgkmcnt(0)`` is required to ensure synchronization between LDS operations and + vector memory operations between wavefronts of a work-group, but not between + operations performed by the same wavefront. +* The vector memory operations are performed as wavefront wide operations and + completion is reported to a wavefront in execution order. The exception is + that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of + vector memory order if they access LDS memory, and out of LDS operation order + if they access global memory. +* The vector memory operations access a single vector L1 cache shared by all + SIMDs a CU. Therefore, no special action is required for coherence between the + lanes of a single wavefront, or for coherence between wavefronts in the same + work-group. A ``buffer_wbinvl1_vol`` is required for coherence between + wavefronts executing in different work-groups as they may be executing on + different CUs. +* The scalar memory operations access a scalar L1 cache shared by all wavefronts + on a group of CUs. The scalar and vector L1 caches are not coherent. However, + scalar operations are used in a restricted way so do not impact the memory + model. See :ref:`amdgpu-address-spaces`. +* The vector and scalar memory operations use an L2 cache shared by all CUs on + the same agent. +* The L2 cache has independent channels to service disjoint ranges of virtual + addresses. +* Each CU has a separate request queue per channel. Therefore, the vector and + scalar memory operations performed by wavefronts executing in different + work-groups (which may be executing on different CUs) of an agent can be + reordered relative to each other. A ``s_waitcnt vmcnt(0)`` is required to + ensure synchronization between vector memory operations of different CUs. It + ensures a previous vector memory operation has completed before executing a + subsequent vector memory or LDS operation and so can be used to meet the + requirements of acquire and release. +* The L2 cache can be kept coherent with other agents on some targets, or ranges + of virtual addresses can be set up to bypass it to ensure system coherence. + +For GFX10: + +* Each agent has multiple shader arrays (SA). +* Each SA has multiple work-group processors (WGP). +* Each WGP has multiple compute units (CU). +* Each CU has multiple SIMDs that execute wavefronts. +* The wavefronts for a single work-group are executed in the same + WGP. In CU wavefront execution mode the wavefronts may be executed by + different SIMDs in the same CU. In WGP wavefront execution mode the + wavefronts may be executed by different SIMDs in different CUs in the same + WGP. +* Each WGP has a single LDS memory shared by the wavefronts of the work-groups + executing on it. +* All LDS operations of a WGP are performed as wavefront wide operations in a + global order and involve no caching. Completion is reported to a wavefront in + execution order. +* The LDS memory has multiple request queues shared by the SIMDs of a + WGP. Therefore, the LDS operations performed by different wavefronts of a + work-group can be reordered relative to each other, which can result in + reordering the visibility of vector memory operations with respect to LDS + operations of other wavefronts in the same work-group. A ``s_waitcnt + lgkmcnt(0)`` is required to ensure synchronization between LDS operations and + vector memory operations between wavefronts of a work-group, but not between + operations performed by the same wavefront. +* The vector memory operations are performed as wavefront wide operations. + Completion of load/store/sample operations are reported to a wavefront in + execution order of other load/store/sample operations performed by that + wavefront. +* The vector memory operations access a vector L0 cache. There is a single L0 + cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no + special action is required for coherence between the lanes of a single + wavefront. However, a ``BUFFER_GL0_INV`` is required for coherence between + wavefronts executing in the same work-group as they may be executing on SIMDs + of different CUs that access different L0s. A ``BUFFER_GL0_INV`` is also + required for coherence between wavefronts executing in different work-groups + as they may be executing on different WGPs. +* The scalar memory operations access a scalar L0 cache shared by all wavefronts + on a WGP. The scalar and vector L0 caches are not coherent. However, scalar + operations are used in a restricted way so do not impact the memory model. See + :ref:`amdgpu-address-spaces`. +* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on + the same SA. Therefore, no special action is required for coherence between + the wavefronts of a single work-group. However, a ``BUFFER_GL1_INV`` is + required for coherence between wavefronts executing in different work-groups + as they may be executing on different SAs that access different L1s. +* The L1 caches have independent quadrants to service disjoint ranges of virtual + addresses. +* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the + vector and scalar memory operations performed by different wavefronts, whether + executing in the same or different work-groups (which may be executing on + different CUs accessing different L0s), can be reordered relative to each + other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is required to ensure + synchronization between vector memory operations of different wavefronts. It + ensures a previous vector memory operation has completed before executing a + subsequent vector memory or LDS operation and so can be used to meet the + requirements of acquire, release and sequential consistency. +* The L1 caches use an L2 cache shared by all SAs on the same agent. +* The L2 cache has independent channels to service disjoint ranges of virtual + addresses. +* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1 + quadrant has a separate request queue per L2 channel. Therefore, the vector + and scalar memory operations performed by wavefronts executing in different + work-groups (which may be executing on different SAs) of an agent can be + reordered relative to each other. A ``s_waitcnt vmcnt(0) & vscnt(0)`` is + required to ensure synchronization between vector memory operations of + different SAs. It ensures a previous vector memory operation has completed + before executing a subsequent vector memory and so can be used to meet the + requirements of acquire, release and sequential consistency. +* The L2 cache can be kept coherent with other agents on some targets, or ranges + of virtual addresses can be set up to bypass it to ensure system coherence. + +Private address space uses ``buffer_load/store`` using the scratch V# +(GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX10). Since only a single thread +is accessing the memory, atomic memory orderings are not meaningful, and all +accesses are treated as non-atomic. + +Constant address space uses ``buffer/global_load`` instructions (or equivalent +scalar memory instructions). Since the constant address space contents do not +change during the execution of a kernel dispatch it is not legal to perform +stores, and atomic memory orderings are not meaningful, and all access are +treated as non-atomic. + +A memory synchronization scope wider than work-group is not meaningful for the +group (LDS) address space and is treated as work-group. + +The memory model does not support the region address space which is treated as +non-atomic. + +Acquire memory ordering is not meaningful on store atomic instructions and is +treated as non-atomic. + +Release memory ordering is not meaningful on load atomic instructions and is +treated a non-atomic. + +Acquire-release memory ordering is not meaningful on load or store atomic +instructions and is treated as acquire and release respectively. + +AMDGPU backend only uses scalar memory operations to access memory that is +proven to not change during the execution of the kernel dispatch. This includes +constant address space and global address space for program scope const +variables. Therefore, the kernel machine code does not have to maintain the +scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar +and vector L1 caches are invalidated between kernel dispatches by CP since +constant address space data may change between kernel dispatch executions. See +:ref:`amdgpu-address-spaces`. + +The one exception is if scalar writes are used to spill SGPR registers. In this +case the AMDGPU backend ensures the memory location used to spill is never +accessed by vector memory operations at the same time. If scalar writes are used +then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function +return since the locations may be used for vector memory instructions by a +future wavefront that uses the same scratch area, or a function call that +creates a frame at the same address, respectively. There is no need for a +``s_dcache_inv`` as all scalar writes are write-before-read in the same thread. + +For GFX6-GFX9, scratch backing memory (which is used for the private address +space) is accessed with MTYPE NC_NV (non-coherent non-volatile). Since the +private address space is only accessed by a single thread, and is always +write-before-read, there is never a need to invalidate these entries from the L1 +cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the +volatile cache lines. + +For GFX10, scratch backing memory (which is used for the private address space) +is accessed with MTYPE NC (non-coherent). Since the private address space is +only accessed by a single thread, and is always write-before-read, there is +never a need to invalidate these entries from the L0 or L1 caches. + +For GFX10, wavefronts are executed in native mode with in-order reporting of +loads and sample instructions. In this mode vmcnt reports completion of load, +atomic with return and sample instructions in order, and the vscnt reports the +completion of store and atomic without return in order. See ``MEM_ORDERED`` +field in :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + +In GFX10, wavefronts can be executed in WGP or CU wavefront execution mode: + +* In WGP wavefront execution mode the wavefronts of a work-group are executed + on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per + CU L0 caches is required for work-group synchronization. Also accesses to L1 + at work-group scope need to be explicitly ordered as the accesses from + different CUs are not ordered. +* In CU wavefront execution mode the wavefronts of a work-group are executed on + the SIMDs of a single CU of the WGP. Therefore, all global memory access by + the work-group access the same L0 which in turn ensures L1 accesses are + ordered and so do not require explicit management of the caches for + work-group synchronization. + +See ``WGP_MODE`` field in +:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table` and +:ref:`amdgpu-target-features`. + +On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing +to invalidate the L2 cache. For GFX6-GFX9, this also causes it to be treated as +non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC +(cache coherent) and so the L2 cache will be coherent with the CPU and other +agents. + + .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX10 + :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx10-table + + ============ ============ ============== ========== =============================== ================================== + LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code AMDGPU Machine Code + Ordering Sync Scope Address GFX6-9 GFX10 + Space + ============ ============ ============== ========== =============================== ================================== + **Non-Atomic** + ---------------------------------------------------------------------------------------------------------------------- + load *none* *none* - global - !volatile & !nontemporal - !volatile & !nontemporal + - generic + - private 1. buffer/global/flat_load 1. buffer/global/flat_load + - constant + - volatile & !nontemporal - volatile & !nontemporal + + 1. buffer/global/flat_load 1. buffer/global/flat_load + glc=1 glc=1 dlc=1 + + - nontemporal - nontemporal + + 1. buffer/global/flat_load 1. buffer/global/flat_load + glc=1 slc=1 slc=1 + + load *none* *none* - local 1. ds_load 1. ds_load + store *none* *none* - global - !nontemporal - !nontemporal + - generic + - private 1. buffer/global/flat_store 1. buffer/global/flat_store + - constant + - nontemporal - nontemporal + + 1. buffer/global/flat_store 1. buffer/global/flat_store + glc=1 slc=1 slc=1 + + store *none* *none* - local 1. ds_store 1. ds_store + **Unordered Atomic** + ---------------------------------------------------------------------------------------------------------------------- + load atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. + store atomic unordered *any* *any* *Same as non-atomic*. *Same as non-atomic*. + atomicrmw unordered *any* *any* *Same as monotonic *Same as monotonic + atomic*. atomic*. + **Monotonic Atomic** + ---------------------------------------------------------------------------------------------------------------------- + load atomic monotonic - singlethread - global 1. buffer/global/flat_load 1. buffer/global/flat_load + - wavefront - generic + load atomic monotonic - workgroup - global 1. buffer/global/flat_load 1. buffer/global/flat_load + - generic glc=1 + + - If CU wavefront execution mode, omit glc=1. + + load atomic monotonic - singlethread - local 1. ds_load 1. ds_load + - wavefront + - workgroup + load atomic monotonic - agent - global 1. buffer/global/flat_load 1. buffer/global/flat_load + - system - generic glc=1 glc=1 dlc=1 + store atomic monotonic - singlethread - global 1. buffer/global/flat_store 1. buffer/global/flat_store + - wavefront - generic + - workgroup + - agent + - system + store atomic monotonic - singlethread - local 1. ds_store 1. ds_store + - wavefront + - workgroup + atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 1. buffer/global/flat_atomic + - wavefront - generic + - workgroup + - agent + - system + atomicrmw monotonic - singlethread - local 1. ds_atomic 1. ds_atomic + - wavefront + - workgroup + **Acquire Atomic** + ---------------------------------------------------------------------------------------------------------------------- + load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 1. buffer/global/ds/flat_load + - wavefront - local + - generic + load atomic acquire - workgroup - global 1. buffer/global/flat_load 1. buffer/global_load glc=1 + + - If CU wavefront execution mode, omit glc=1. + + 2. s_waitcnt vmcnt(0) + + - If CU wavefront execution mode, omit. + - Must happen before + the following buffer_gl0_inv + and before any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + load atomic acquire - workgroup - local 1. ds_load 1. ds_load + 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. - If OpenCL, omit. + - Must happen before - Must happen before + any following the following buffer_gl0_inv + global/generic and before any following + load/load global/generic load/load + atomic/store/store atomic/store/store + atomic/atomicrmw. atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the load older than the load + atomic value being atomic value being + acquired. acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - If OpenCL, omit. + - Ensures that + following + loads will not see + stale data. + + load atomic acquire - workgroup - generic 1. flat_load 1. flat_load glc=1 + + - If CU wavefront execution mode, omit glc=1. + + 2. s_waitcnt lgkmcnt(0) 2. s_waitcnt lgkmcnt(0) & + vmcnt(0) + + - If CU wavefront execution mode, omit vmcnt. + - If OpenCL, omit. - If OpenCL, omit + lgkmcnt(0). + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv and any + load/load following global/generic + atomic/store/store load/load + atomic/atomicrmw. atomic/store/store + atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the load older than the load + atomic value being atomic value being + acquired. acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + load atomic acquire - agent - global 1. buffer/global/flat_load 1. buffer/global_load + - system glc=1 glc=1 dlc=1 + 2. s_waitcnt vmcnt(0) 2. s_waitcnt vmcnt(0) + + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the load - Ensures the load + has completed has completed + before invalidating before invalidating + the cache. the caches. + + 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following following + loads will not see loads will not see + stale global data. stale global data. + + load atomic acquire - agent - generic 1. flat_load glc=1 1. flat_load glc=1 dlc=1 + - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vmcnt(0) & + lgkmcnt(0) lgkmcnt(0) + + - If OpenCL omit - If OpenCL omit + lgkmcnt(0). lgkmcnt(0). + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_invl. + - Ensures the flat_load - Ensures the flat_load + has completed has completed + before invalidating before invalidating + the cache. the caches. + + 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic 1. buffer/global_atomic + 2. s_waitcnt vm/vscnt(0) + + - If CU wavefront execution mode, omit. + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + - Must happen before + the following buffer_gl0_inv + and before any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acquire - workgroup - local 1. ds_atomic 1. ds_atomic + 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) + + - If OpenCL, omit. - If OpenCL, omit. + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv. + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the older than the + atomicrmw value atomicrmw value + being acquired. being acquired. + + 3. buffer_gl0_inv + + - If OpenCL omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acquire - workgroup - generic 1. flat_atomic 1. flat_atomic + 2. waitcnt lgkmcnt(0) 2. waitcnt lgkmcnt(0) & + vm/vscnt(0) + + - If CU wavefront execution mode, omit vm/vscnt. + - If OpenCL, omit. - If OpenCL, omit + waitcnt lgkmcnt(0).. + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + waitcnt lgkmcnt(0). + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv. + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the older than the + atomicrmw value atomicrmw value + being acquired. being acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acquire - agent - global 1. buffer/global/flat_atomic 1. buffer/global_atomic + - system 2. s_waitcnt vmcnt(0) 2. s_waitcnt vm/vscnt(0) + + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + waitcnt lgkmcnt(0). + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the - Ensures the + atomicrmw has atomicrmw has + completed before completed before + invalidating the invalidating the + cache. caches. + + 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + atomicrmw acquire - agent - generic 1. flat_atomic 1. flat_atomic + - system 2. s_waitcnt vmcnt(0) & 2. s_waitcnt vm/vscnt(0) & + lgkmcnt(0) lgkmcnt(0) + + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the - Ensures the + atomicrmw has atomicrmw has + completed before completed before + invalidating the invalidating the + cache. caches. + + 3. buffer_wbinvl1_vol 3. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + fence acquire - singlethread *none* *none* *none* + - wavefront + fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit. not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate. If always generate. If + fence had an fence had an + address space then address space then + set to address set to address + space of OpenCL space of OpenCL + fence flag, or to fence flag, or to + generic if both generic if both + local and global local and global + flags are flags are + specified. specified. + - Must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + any following + global/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any + following global + data read is no + older than the + value read by the + fence-paired-atomic. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + atomicrmw-no-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_gl0_inv. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit not generic, omit + lgkmcnt(0). lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate always generate + (see comment for (see comment for + previous fence). previous fence). + - Could be split into + separate s_waitcnt + vmcnt(0) and + s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_wbinvl1_vol. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + atomicrmw-no-return-value + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load + atomic/atomicrmw + with an equal or + wider sync scope + and memory ordering + stronger than + unordered (this is + termed the + fence-paired-atomic). + - Must happen before + the following + buffer_gl*_inv. + - Ensures that the + fence-paired atomic + has completed + before invalidating + the + caches. Therefore + any following + locations read must + be no older than + the value read by + the + fence-paired-atomic. + + 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before any - Must happen before any + following global/generic following global/generic + load/load load/load + atomic/store/store atomic/store/store + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + **Release Atomic** + ---------------------------------------------------------------------------------------------------------------------- + store atomic release - singlethread - global 1. buffer/global/ds/flat_store 1. buffer/global/ds/flat_store + - wavefront - local + - generic + store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + lgkmcnt(0). + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + store. store. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + store that is being store that is being + released. released. + + 2. buffer/global/flat_store 2. buffer/global_store + store atomic release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 1. ds_store 2. ds_store + store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + lgkmcnt(0). + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load/store/load + atomic/store atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + store. store. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + store that is being store that is being + released. released. + + 2. flat_store 2. flat_store + store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt vscnt(0) + s_waitcnt and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load + atomic/store atomic/ + atomic/atomicrmw. atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + store. store. + - Ensures that all - Ensures that all + memory operations memory operations + to memory have to memory have + completed before completed before + performing the performing the + store that is being store that is being + released. released. + + 2. buffer/global/ds/flat_store 2. buffer/global/ds/flat_store + atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. + + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. + + 2. buffer/global/flat_atomic 2. buffer/global_atomic + atomicrmw release - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 1. ds_atomic 2. ds_atomic + atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + waitcnt lgkmcnt(0). + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load/store/load + atomic/store atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. + + 2. flat_atomic 2. flat_atomic + atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lkkmcnt(0) & + - system - generic vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load atomic/ + atomic/store atomicrmw-with-return-value. + atomic/atomicrmw. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to global and local to global and local + have completed have completed + before performing before performing + the atomicrmw that the atomicrmw that + is being released. is being released. + + 2. buffer/global/ds/flat_atomic 2. buffer/global/ds/flat_atomic + fence release - singlethread *none* *none* *none* + - wavefront + fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit. not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate. If always generate. If + fence had an fence had an + address space then address space then + set to address set to address + space of OpenCL space of OpenCL + fence flag, or to fence flag, or to + generic if both generic if both + local and global local and global + flags are flags are + specified. specified. + - Must happen after + any preceding + local/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/ + atomicrmw. + - Must happen before - Must happen before + any following store any following store + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + fence-paired-atomic). fence-paired-atomic). + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + following following + fence-paired-atomic. fence-paired-atomic. + + fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit not generic, omit + lgkmcnt(0). lgkmcnt(0). + - If OpenCL and - If OpenCL and + address space is address space is + local, omit local, omit + vmcnt(0). vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate. If always generate. If + fence had an fence had an + address space then address space then + set to address set to address + space of OpenCL space of OpenCL + fence flag, or to fence flag, or to + generic if both generic if both + local and global local and global + flags are flags are + specified. specified. + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load atomic/ + atomic/store atomicrmw-with-return-value. + atomic/atomicrmw. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + any following store any following store + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + fence-paired-atomic). fence-paired-atomic). + - Ensures that all - Ensures that all + memory operations memory operations + have have + completed before completed before + performing the performing the + following following + fence-paired-atomic. fence-paired-atomic. + + **Acquire-Release Atomic** + ---------------------------------------------------------------------------------------------------------------------- + atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 1. buffer/global/ds/flat_atomic + - wavefront - local + - generic + atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + s_waitcnt lgkmcnt(0). + - Must happen after - Must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load/store/load + atomic/store atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. + + 2. buffer/global/flat_atomic 2. buffer/global_atomic + 3. s_waitcnt vm/vscnt(0) + + - If CU wavefront execution mode, omit vm/vscnt. + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + waitcnt lgkmcnt(0). + - Must happen before + the following + buffer_gl0_inv. + - Ensures any + following global + data read is no + older than the + atomicrmw value + being acquired. + + 4. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - workgroup - local 1. waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit. + - If OpenCL, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - Must happen before + the following + store. + - Ensures that all + global memory + operations have + completed before + performing the + store that is being + released. + + 1. ds_atomic 2. ds_atomic + 2. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) + + - If OpenCL, omit. - If OpenCL, omit. + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv. + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the load older than the load + atomic value being atomic value being + acquired. acquired. + + 4. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - If OpenCL omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL, omit. - If OpenCL, omit + waitcnt lgkmcnt(0). + - Must happen after + any preceding + local/generic + load/store/load + atomic/store + atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store + atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic load/store/load + atomic/store atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. + + 2. flat_atomic 2. flat_atomic + 3. s_waitcnt lgkmcnt(0) 3. s_waitcnt lgkmcnt(0) & + vm/vscnt(0) + + - If CU wavefront execution mode, omit vm/vscnt. + - If OpenCL, omit. - If OpenCL, omit + waitcnt lgkmcnt(0). + - Must happen before - Must happen before + any following the following + global/generic buffer_gl0_inv. + load/load + atomic/store/store + atomic/atomicrmw. + - Ensures any - Ensures any + following global following global + data read is no data read is no + older than the load older than the load + atomic value being atomic value being + acquired. acquired. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load atomic/ + atomic/store atomicrmw-with-return-value. + atomic/atomicrmw. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to global have to global have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. + + 2. buffer/global/flat_atomic 2. buffer/global_atomic + 3. s_waitcnt vmcnt(0) 3. s_waitcnt vm/vscnt(0) + + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + waitcnt lgkmcnt(0). + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the - Ensures the + atomicrmw has atomicrmw has + completed before completed before + invalidating the invalidating the + cache. caches. + + 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load atomic + atomic/store atomicrmw-with-return-value. + atomic/atomicrmw. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + atomicrmw. atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to global have have + completed before completed before + performing the performing the + atomicrmw that is atomicrmw that is + being released. being released. + + 2. flat_atomic 2. flat_atomic + 3. s_waitcnt vmcnt(0) & 3. s_waitcnt vm/vscnt(0) & + lgkmcnt(0) lgkmcnt(0) + + - If OpenCL, omit - If OpenCL, omit + lgkmcnt(0). lgkmcnt(0). + - Use vmcnt if atomic with + return and vscnt if atomic + with no-return. + - Must happen before - Must happen before + following following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures the - Ensures the + atomicrmw has atomicrmw has + completed before completed before + invalidating the invalidating the + cache. caches. + + 4. buffer_wbinvl1_vol 4. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. global data. + + fence acq_rel - singlethread *none* *none* *none* + - wavefront + fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit. not generic, omit + lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, - However, + since LLVM since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate always generate + (see comment for (see comment for + previous fence). previous fence). + - Must happen after + any preceding + local/generic + load/load + atomic/store/store + atomic/atomicrmw. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - s_waitcnt vmcnt(0) + must happen after + any preceding + global/generic + load/load + atomic/ + atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) + must happen after + any preceding + local/generic + load/store/load + atomic/store atomic/ + atomicrmw. + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/store/store atomic/store/store + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that all - Ensures that all + memory operations memory operations + to local have have + completed before completed before + performing any performing any + following global following global + memory operations. memory operations. + - Ensures that the - Ensures that the + preceding preceding + local/generic load local/generic load + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + acquire-fence-paired-atomic acquire-fence-paired-atomic + ) has completed ) has completed + before following before following + global memory global memory + operations. This operations. This + satisfies the satisfies the + requirements of requirements of + acquire. acquire. + - Ensures that all - Ensures that all + previous memory previous memory + operations have operations have + completed before a completed before a + following following + local/generic store local/generic store + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + release-fence-paired-atomic release-fence-paired-atomic + ). This satisfies the ). This satisfies the + requirements of requirements of + release. release. + - Must happen before + the following + buffer_gl0_inv. + - Ensures that the + acquire-fence-paired + atomic has completed + before invalidating + the + cache. Therefore + any following + locations read must + be no older than + the value read by + the + acquire-fence-paired-atomic. + + 3. buffer_gl0_inv + + - If CU wavefront execution mode, omit. + - Ensures that + following + loads will not see + stale data. + + fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system vmcnt(0) vmcnt(0) & vscnt(0) + + - If OpenCL and - If OpenCL and + address space is address space is + not generic, omit not generic, omit + lgkmcnt(0). lgkmcnt(0). + - If OpenCL and + address space is + local, omit + vmcnt(0) and vscnt(0). + - However, since LLVM - However, since LLVM + currently has no currently has no + address space on address space on + the fence need to the fence need to + conservatively conservatively + always generate always generate + (see comment for (see comment for + previous fence). previous fence). + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) and vmcnt(0), s_waitcnt + s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - s_waitcnt vmcnt(0) - s_waitcnt vmcnt(0) + must happen after must happen after + any preceding any preceding + global/generic global/generic + load/store/load load/load + atomic/store atomic/ + atomic/atomicrmw. atomicrmw-with-return-value. + - s_waitcnt vscnt(0) + must happen after + any preceding + global/generic + store/store atomic/ + atomicrmw-no-return-value. + - s_waitcnt lgkmcnt(0) - s_waitcnt lgkmcnt(0) + must happen after must happen after + any preceding any preceding + local/generic local/generic + load/store/load load/store/load + atomic/store atomic/store + atomic/atomicrmw. atomic/atomicrmw. + - Must happen before - Must happen before + the following the following + buffer_wbinvl1_vol. buffer_gl*_inv. + - Ensures that the - Ensures that the + preceding preceding + global/local/generic global/local/generic + load load + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + acquire-fence-paired-atomic acquire-fence-paired-atomic + ) has completed ) has completed + before invalidating before invalidating + the cache. This the caches. This + satisfies the satisfies the + requirements of requirements of + acquire. acquire. + - Ensures that all - Ensures that all + previous memory previous memory + operations have operations have + completed before a completed before a + following following + global/local/generic global/local/generic + store store + atomic/atomicrmw atomic/atomicrmw + with an equal or with an equal or + wider sync scope wider sync scope + and memory ordering and memory ordering + stronger than stronger than + unordered (this is unordered (this is + termed the termed the + release-fence-paired-atomic release-fence-paired-atomic + ). This satisfies the ). This satisfies the + requirements of requirements of + release. release. + + 2. buffer_wbinvl1_vol 2. buffer_gl0_inv; + buffer_gl1_inv + + - Must happen before - Must happen before + any following any following + global/generic global/generic + load/load load/load + atomic/store/store atomic/store/store + atomic/atomicrmw. atomic/atomicrmw. + - Ensures that - Ensures that + following loads following loads + will not see stale will not see stale + global data. This global data. This + satisfies the satisfies the + requirements of requirements of + acquire. acquire. + + **Sequential Consistent Atomic** + ---------------------------------------------------------------------------------------------------------------------- + load atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding + - wavefront - local load atomic acquire, load atomic acquire, + - generic except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 1. s_waitcnt lgkmcnt(0) & + - generic vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit vmcnt and + vscnt. + - Could be split into + separate s_waitcnt + vmcnt(0), s_waitcnt + vscnt(0) and s_waitcnt + lgkmcnt(0) to allow + them to be + independently moved + according to the + following rules. + - Must - waitcnt lgkmcnt(0) must + happen after happen after + preceding preceding + global/generic load local load + atomic/store atomic/store + atomic/atomicrmw atomic/atomicrmw + with memory with memory + ordering of seq_cst ordering of seq_cst + and with equal or and with equal or + wider sync scope. wider sync scope. + (Note that seq_cst (Note that seq_cst + fences have their fences have their + own s_waitcnt own s_waitcnt + lgkmcnt(0) and so do lgkmcnt(0) and so do + not need to be not need to be + considered.) considered.) + - waitcnt vmcnt(0) + Must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any - Ensures any + preceding preceding + sequential sequential + consistent local consistent global/local + memory instructions memory instructions + have completed have completed + before executing before executing + this sequentially this sequentially + consistent consistent + instruction. This instruction. This + prevents reordering prevents reordering + a seq_cst store a seq_cst store + followed by a followed by a + seq_cst load. (Note seq_cst load. (Note + that seq_cst is that seq_cst is + stronger than stronger than + acquire/release as acquire/release as + the reordering of the reordering of + load acquire load acquire + followed by a store followed by a store + release is release is + prevented by the prevented by the + waitcnt of waitcnt of + the release, but the release, but + there is nothing there is nothing + preventing a store preventing a store + release followed by release followed by + load acquire from load acquire from + competing out of competing out of + order.) order.) + + 2. *Following 2. *Following + instructions same as instructions same as + corresponding load corresponding load + atomic acquire, atomic acquire, + except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + load atomic seq_cst - workgroup - local *Same as corresponding + load atomic acquire, + except must generated + all instructions even + for OpenCL.* + + 1. s_waitcnt vmcnt(0) & vscnt(0) + + - If CU wavefront execution mode, omit. + - Could be split into + separate s_waitcnt + vmcnt(0) and s_waitcnt + vscnt(0) to allow + them to be + independently moved + according to the + following rules. + - waitcnt vmcnt(0) + Must happen after + preceding + global/generic load + atomic/ + atomicrmw-with-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vmcnt(0) and so do + not need to be + considered.) + - waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any + preceding + sequential + consistent global + memory instructions + have completed + before executing + this sequentially + consistent + instruction. This + prevents reordering + a seq_cst store + followed by a + seq_cst load. (Note + that seq_cst is + stronger than + acquire/release as + the reordering of + load acquire + followed by a store + release is + prevented by the + waitcnt of + the release, but + there is nothing + preventing a store + release followed by + load acquire from + competing out of + order.) + + 2. *Following + instructions same as + corresponding load + atomic acquire, + except must generated + all instructions even + for OpenCL.* + + load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 1. s_waitcnt lgkmcnt(0) & + - system - generic vmcnt(0) vmcnt(0) & vscnt(0) + + - Could be split into - Could be split into + separate s_waitcnt separate s_waitcnt + vmcnt(0) vmcnt(0), s_waitcnt + and s_waitcnt vscnt(0) and s_waitcnt + lgkmcnt(0) to allow lgkmcnt(0) to allow + them to be them to be + independently moved independently moved + according to the according to the + following rules. following rules. + - waitcnt lgkmcnt(0) - waitcnt lgkmcnt(0) + must happen after must happen after + preceding preceding + global/generic load local load + atomic/store atomic/store + atomic/atomicrmw atomic/atomicrmw + with memory with memory + ordering of seq_cst ordering of seq_cst + and with equal or and with equal or + wider sync scope. wider sync scope. + (Note that seq_cst (Note that seq_cst + fences have their fences have their + own s_waitcnt own s_waitcnt + lgkmcnt(0) and so do lgkmcnt(0) and so do + not need to be not need to be + considered.) considered.) + - waitcnt vmcnt(0) - waitcnt vmcnt(0) + must happen after must happen after + preceding preceding + global/generic load global/generic load + atomic/store atomic/ + atomic/atomicrmw atomicrmw-with-return-value + with memory with memory + ordering of seq_cst ordering of seq_cst + and with equal or and with equal or + wider sync scope. wider sync scope. + (Note that seq_cst (Note that seq_cst + fences have their fences have their + own s_waitcnt own s_waitcnt + vmcnt(0) and so do vmcnt(0) and so do + not need to be not need to be + considered.) considered.) + - waitcnt vscnt(0) + Must happen after + preceding + global/generic store + atomic/ + atomicrmw-no-return-value + with memory + ordering of seq_cst + and with equal or + wider sync scope. + (Note that seq_cst + fences have their + own s_waitcnt + vscnt(0) and so do + not need to be + considered.) + - Ensures any - Ensures any + preceding preceding + sequential sequential + consistent global consistent global + memory instructions memory instructions + have completed have completed + before executing before executing + this sequentially this sequentially + consistent consistent + instruction. This instruction. This + prevents reordering prevents reordering + a seq_cst store a seq_cst store + followed by a followed by a + seq_cst load. (Note seq_cst load. (Note + that seq_cst is that seq_cst is + stronger than stronger than + acquire/release as acquire/release as + the reordering of the reordering of + load acquire load acquire + followed by a store followed by a store + release is release is + prevented by the prevented by the + waitcnt of waitcnt of + the release, but the release, but + there is nothing there is nothing + preventing a store preventing a store + release followed by release followed by + load acquire from load acquire from + competing out of competing out of + order.) order.) + + 2. *Following 2. *Following + instructions same as instructions same as + corresponding load corresponding load + atomic acquire, atomic acquire, + except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + store atomic seq_cst - singlethread - global *Same as corresponding *Same as corresponding + - wavefront - local store atomic release, store atomic release, + - workgroup - generic except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + store atomic seq_cst - agent - global *Same as corresponding *Same as corresponding + - system - generic store atomic release, store atomic release, + except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + atomicrmw seq_cst - singlethread - global *Same as corresponding *Same as corresponding + - wavefront - local atomicrmw acq_rel, atomicrmw acq_rel, + - workgroup - generic except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + atomicrmw seq_cst - agent - global *Same as corresponding *Same as corresponding + - system - generic atomicrmw acq_rel, atomicrmw acq_rel, + except must generated except must generated + all instructions even all instructions even + for OpenCL.* for OpenCL.* + fence seq_cst - singlethread *none* *Same as corresponding *Same as corresponding + - wavefront fence acq_rel, fence acq_rel, + - workgroup except must generated except must generated + - agent all instructions even all instructions even + - system for OpenCL.* for OpenCL.* + ============ ============ ============== ========== =============================== ================================== + +The memory order also adds the single thread optimization constrains defined in +table +:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table`. + + .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX10 + :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx10-table + + ============ ============================================================== + LLVM Memory Optimization Constraints + Ordering + ============ ============================================================== + unordered *none* + monotonic *none* + acquire - If a load atomic/atomicrmw then no following load/load + atomic/store/ store atomic/atomicrmw/fence instruction can + be moved before the acquire. + - If a fence then same as load atomic, plus no preceding + associated fence-paired-atomic can be moved after the fence. + release - If a store atomic/atomicrmw then no preceding load/load + atomic/store/ store atomic/atomicrmw/fence instruction can + be moved after the release. + - If a fence then same as store atomic, plus no following + associated fence-paired-atomic can be moved before the + fence. + acq_rel Same constraints as both acquire and release. + seq_cst - If a load atomic then same constraints as acquire, plus no + preceding sequentially consistent load atomic/store + atomic/atomicrmw/fence instruction can be moved after the + seq_cst. + - If a store atomic then the same constraints as release, plus + no following sequentially consistent load atomic/store + atomic/atomicrmw/fence instruction can be moved before the + seq_cst. + - If an atomicrmw/fence then same constraints as acq_rel. + ============ ============================================================== + +Trap Handler ABI +~~~~~~~~~~~~~~~~ + +For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes +(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports +the ``s_trap`` instruction with the following usage: + + .. table:: AMDGPU Trap Handler for AMDHSA OS + :name: amdgpu-trap-handler-for-amdhsa-os-table + + =================== =============== =============== ======================= + Usage Code Sequence Trap Handler Description + Inputs + =================== =============== =============== ======================= + reserved ``s_trap 0x00`` Reserved by hardware. + ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA + ``queue_ptr`` ``debugtrap`` + ``VGPR0``: intrinsic (not + ``arg`` implemented). + ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be + ``queue_ptr`` terminated and its + associated queue put + into the error state. + ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not + installed then + behaves as a + no-operation. The + trap handler is + entered and + immediately returns + to continue + execution of the + wavefront. + - If the debugger is + installed, causes + the debug trap to be + reported by the + debugger and the + wavefront is put in + the halt state until + resumed by the + debugger. + reserved ``s_trap 0x04`` Reserved. + reserved ``s_trap 0x05`` Reserved. + reserved ``s_trap 0x06`` Reserved. + debugger breakpoint ``s_trap 0x07`` Reserved for debugger + breakpoints. + reserved ``s_trap 0x08`` Reserved. + reserved ``s_trap 0xfe`` Reserved. + reserved ``s_trap 0xff`` Reserved. + =================== =============== =============== ======================= + +.. _amdgpu-amdhsa-function-call-convention: + +Call Convention +~~~~~~~~~~~~~~~ + +.. note:: + + This section is currently incomplete and has inaccuracies. It is WIP that will + be updated as information is determined. + +See :ref:`amdgpu-dwarf-address-space-identifier` for information on swizzled +addresses. Unswizzled addresses are normal linear addresses. + +.. _amdgpu-amdhsa-function-call-convention-kernel-functions: + +Kernel Functions +++++++++++++++++ + +This section describes the call convention ABI for the outer kernel function. + +See :ref:`amdgpu-amdhsa-initial-kernel-execution-state` for the kernel call +convention. + +The following is not part of the AMDGPU kernel calling convention but describes +how the AMDGPU implements function calls: + +1. Clang decides the kernarg layout to match the *HSA Programmer's Language + Reference* [HSA]_. + + - All structs are passed directly. + - Lambda values are passed *TBA*. + + .. TODO:: + + - Does this really follow HSA rules? Or are structs >16 bytes passed + by-value struct? + - What is ABI for lambda values? + +4. The kernel performs certain setup in its prolog, as described in + :ref:`amdgpu-amdhsa-kernel-prolog`. + +.. _amdgpu-amdhsa-function-call-convention-non-kernel-functions: + +Non-Kernel Functions +++++++++++++++++++++ + +This section describes the call convention ABI for functions other than the +outer kernel function. + +If a kernel has function calls then scratch is always allocated and used for +the call stack which grows from low address to high address using the swizzled +scratch address space. + +On entry to a function: + +1. SGPR0-3 contain a V# with the following properties (see + :ref:`amdgpu-amdhsa-kernel-prolog-private-segment-buffer`): + + * Base address pointing to the beginning of the wavefront scratch backing + memory. + * Swizzled with dword element size and stride of wavefront size elements. + +2. The FLAT_SCRATCH register pair is setup. See + :ref:`amdgpu-amdhsa-kernel-prolog-flat-scratch`. +3. GFX6-8: M0 register set to the size of LDS in bytes. See + :ref:`amdgpu-amdhsa-kernel-prolog-m0`. +4. The EXEC register is set to the lanes active on entry to the function. +5. MODE register: *TBD* +6. VGPR0-31 and SGPR4-29 are used to pass function input arguments as described + below. +7. SGPR30-31 return address (RA). The code address that the function must + return to when it completes. The value is undefined if the function is *no + return*. +8. SGPR32 is used for the stack pointer (SP). It is an unswizzled scratch + offset relative to the beginning of the wavefront scratch backing memory. + + The unswizzled SP can be used with buffer instructions as an unswizzled SGPR + offset with the scratch V# in SGPR0-3 to access the stack in a swizzled + manner. + + The unswizzled SP value can be converted into the swizzled SP value by: + + | swizzled SP = unswizzled SP / wavefront size + + This may be used to obtain the private address space address of stack + objects and to convert this address to a flat address by adding the flat + scratch aperture base address. + + The swizzled SP value is always 4 bytes aligned for the ``r600`` + architecture and 16 byte aligned for the ``amdgcn`` architecture. + + .. note:: + + The ``amdgcn`` value is selected to avoid dynamic stack alignment for the + OpenCL language which has the largest base type defined as 16 bytes. + + On entry, the swizzled SP value is the address of the first function + argument passed on the stack. Other stack passed arguments are positive + offsets from the entry swizzled SP value. + + The function may use positive offsets beyond the last stack passed argument + for stack allocated local variables and register spill slots. If necessary, + the function may align these to greater alignment than 16 bytes. After these + the function may dynamically allocate space for such things as runtime sized + ``alloca`` local allocations. + + If the function calls another function, it will place any stack allocated + arguments after the last local allocation and adjust SGPR32 to the address + after the last local allocation. + +9. All other registers are unspecified. +10. Any necessary ``waitcnt`` has been performed to ensure memory is available + to the function. + +On exit from a function: + +1. VGPR0-31 and SGPR4-29 are used to pass function result arguments as + described below. Any registers used are considered clobbered registers. +2. The following registers are preserved and have the same value as on entry: + + * FLAT_SCRATCH + * EXEC + * GFX6-8: M0 + * All SGPR registers except the clobbered registers of SGPR4-31. + * VGPR40-47 + VGPR56-63 + VGPR72-79 + VGPR88-95 + VGPR104-111 + VGPR120-127 + VGPR136-143 + VGPR152-159 + VGPR168-175 + VGPR184-191 + VGPR200-207 + VGPR216-223 + VGPR232-239 + VGPR248-255 + + *Except the argument registers, the VGPR clobbered and the preserved + registers are intermixed at regular intervals in order to + get a better occupancy.* + + For the AMDGPU backend, an inter-procedural register allocation (IPRA) + optimization may mark some of clobbered SGPR and VGPR registers as + preserved if it can be determined that the called function does not change + their value. + +2. The PC is set to the RA provided on entry. +3. MODE register: *TBD*. +4. All other registers are clobbered. +5. Any necessary ``waitcnt`` has been performed to ensure memory accessed by + function is available to the caller. + +.. TODO:: + + - On gfx908 are all ACC registers clobbered? + + - How are function results returned? The address of structured types is passed + by reference, but what about other types? + +The function input arguments are made up of the formal arguments explicitly +declared by the source language function plus the implicit input arguments used +by the implementation. + +The source language input arguments are: + +1. Any source language implicit ``this`` or ``self`` argument comes first as a + pointer type. +2. Followed by the function formal arguments in left to right source order. + +The source language result arguments are: + +1. The function result argument. + +The source language input or result struct type arguments that are less than or +equal to 16 bytes, are decomposed recursively into their base type fields, and +each field is passed as if a separate argument. For input arguments, if the +called function requires the struct to be in memory, for example because its +address is taken, then the function body is responsible for allocating a stack +location and copying the field arguments into it. Clang terms this *direct +struct*. + +The source language input struct type arguments that are greater than 16 bytes, +are passed by reference. The caller is responsible for allocating a stack +location to make a copy of the struct value and pass the address as the input +argument. The called function is responsible to perform the dereference when +accessing the input argument. Clang terms this *by-value struct*. + +A source language result struct type argument that is greater than 16 bytes, is +returned by reference. The caller is responsible for allocating a stack location +to hold the result value and passes the address as the last input argument +(before the implicit input arguments). In this case there are no result +arguments. The called function is responsible to perform the dereference when +storing the result value. Clang terms this *structured return (sret)*. + +*TODO: correct the ``sret`` definition.* + +.. TODO:: + + Is this definition correct? Or is ``sret`` only used if passing in registers, and + pass as non-decomposed struct as stack argument? Or something else? Is the + memory location in the caller stack frame, or a stack memory argument and so + no address is passed as the caller can directly write to the argument stack + location? But then the stack location is still live after return. If an + argument stack location is it the first stack argument or the last one? + +Lambda argument types are treated as struct types with an implementation defined +set of fields. + +.. TODO:: + + Need to specify the ABI for lambda types for AMDGPU. + +For AMDGPU backend all source language arguments (including the decomposed +struct type arguments) are passed in VGPRs unless marked ``inreg`` in which case +they are passed in SGPRs. + +The AMDGPU backend walks the function call graph from the leaves to determine +which implicit input arguments are used, propagating to each caller of the +function. The used implicit arguments are appended to the function arguments +after the source language arguments in the following order: + +.. TODO:: + + Is recursion or external functions supported? + +1. Work-Item ID (1 VGPR) + + The X, Y and Z work-item ID are packed into a single VGRP with the following + layout. Only fields actually used by the function are set. The other bits + are undefined. + + The values come from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. + + .. table:: Work-item implicit argument layout + :name: amdgpu-amdhsa-workitem-implicit-argument-layout-table + + ======= ======= ============== + Bits Size Field Name + ======= ======= ============== + 9:0 10 bits X Work-Item ID + 19:10 10 bits Y Work-Item ID + 29:20 10 bits Z Work-Item ID + 31:30 2 bits Unused + ======= ======= ============== + +2. Dispatch Ptr (2 SGPRs) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +3. Queue Ptr (2 SGPRs) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +4. Kernarg Segment Ptr (2 SGPRs) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +5. Dispatch id (2 SGPRs) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +6. Work-Group ID X (1 SGPR) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +7. Work-Group ID Y (1 SGPR) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +8. Work-Group ID Z (1 SGPR) + + The value comes from the initial kernel execution state. See + :ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. + +9. Implicit Argument Ptr (2 SGPRs) + + The value is computed by adding an offset to Kernarg Segment Ptr to get the + global address space pointer to the first kernarg implicit argument. + +The input and result arguments are assigned in order in the following manner: + +.. note:: + + There are likely some errors and omissions in the following description that + need correction. + + .. TODO:: + + Check the clang source code to decipher how function arguments and return + results are handled. Also see the AMDGPU specific values used. + +* VGPR arguments are assigned to consecutive VGPRs starting at VGPR0 up to + VGPR31. + + If there are more arguments than will fit in these registers, the remaining + arguments are allocated on the stack in order on naturally aligned + addresses. + + .. TODO:: + + How are overly aligned structures allocated on the stack? + +* SGPR arguments are assigned to consecutive SGPRs starting at SGPR0 up to + SGPR29. + + If there are more arguments than will fit in these registers, the remaining + arguments are allocated on the stack in order on naturally aligned + addresses. + +Note that decomposed struct type arguments may have some fields passed in +registers and some in memory. + +.. TODO:: + + So, a struct which can pass some fields as decomposed register arguments, will + pass the rest as decomposed stack elements? But an argument that will not start + in registers will not be decomposed and will be passed as a non-decomposed + stack value? + +The following is not part of the AMDGPU function calling convention but +describes how the AMDGPU implements function calls: + +1. SGPR33 is used as a frame pointer (FP) if necessary. Like the SP it is an + unswizzled scratch address. It is only needed if runtime sized ``alloca`` + are used, or for the reasons defined in ``SIFrameLowering``. +2. Runtime stack alignment is supported. SGPR34 is used as a base pointer (BP) + to access the incoming stack arguments in the function. The BP is needed + only when the function requires the runtime stack alignment. + +3. Allocating SGPR arguments on the stack are not supported. + +4. No CFI is currently generated. See + :ref:`amdgpu-dwarf-call-frame-information`. + + .. note:: + + CFI will be generated that defines the CFA as the unswizzled address + relative to the wave scratch base in the unswizzled private address space + of the lowest address stack allocated local variable. + + ``DW_AT_frame_base`` will be defined as the swizzled address in the + swizzled private address space by dividing the CFA by the wavefront size + (since CFA is always at least dword aligned which matches the scratch + swizzle element size). + + If no dynamic stack alignment was performed, the stack allocated arguments + are accessed as negative offsets relative to ``DW_AT_frame_base``, and the + local variables and register spill slots are accessed as positive offsets + relative to ``DW_AT_frame_base``. + +5. Function argument passing is implemented by copying the input physical + registers to virtual registers on entry. The register allocator can spill if + necessary. These are copied back to physical registers at call sites. The + net effect is that each function call can have these values in entirely + distinct locations. The IPRA can help avoid shuffling argument registers. +6. Call sites are implemented by setting up the arguments at positive offsets + from SP. Then SP is incremented to account for the known frame size before + the call and decremented after the call. + + .. note:: + + The CFI will reflect the changed calculation needed to compute the CFA + from SP. + +7. 4 byte spill slots are used in the stack frame. One slot is allocated for an + emergency spill slot. Buffer instructions are used for stack accesses and + not the ``flat_scratch`` instruction. + + .. TODO:: + + Explain when the emergency spill slot is used. + +.. TODO:: + + Possible broken issues: + + - Stack arguments must be aligned to required alignment. + - Stack is aligned to max(16, max formal argument alignment) + - Direct argument < 64 bits should check register budget. + - Register budget calculation should respect ``inreg`` for SGPR. + - SGPR overflow is not handled. + - struct with 1 member unpeeling is not checking size of member. + - ``sret`` is after ``this`` pointer. + - Caller is not implementing stack realignment: need an extra pointer. + - Should say AMDGPU passes FP rather than SP. + - Should CFI define CFA as address of locals or arguments. Difference is + apparent when have implemented dynamic alignment. + - If ``SCRATCH`` instruction could allow negative offsets, then can make FP be + highest address of stack frame and use negative offset for locals. Would + allow SP to be the same as FP and could support signal-handler-like as now + have a real SP for the top of the stack. + - How is ``sret`` passed on the stack? In argument stack area? Can it overlay + arguments? + +AMDPAL +------ + +This section provides code conventions used when the target triple OS is +``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters +from the application/runtime to each invocation of a hardware shader. These +parameters include both generic, application-controlled parameters called +*user data* as well as system-generated parameters that are a product of the +draw or dispatch execution. + +User Data +~~~~~~~~~ + +Each hardware stage has a set of 32-bit *user data registers* which can be +written from a command buffer and then loaded into SGPRs when waves are launched +via a subsequent dispatch or draw operation. This is the way most arguments are +passed from the application/runtime to a hardware shader. + +Compute User Data +~~~~~~~~~~~~~~~~~ + +Compute shader user data mappings are simpler than graphics shaders and have a +fixed mapping. + +Note that there are always 10 available *user data entries* in registers - +entries beyond that limit must be fetched from memory (via the spill table +pointer) by the shader. + + .. table:: PAL Compute Shader User Data Registers + :name: pal-compute-user-data-registers + + ============= ================================ + User Register Description + ============= ================================ + 0 Global Internal Table (32-bit pointer) + 1 Per-Shader Internal Table (32-bit pointer) + 2 - 11 Application-Controlled User Data (10 32-bit values) + 12 Spill Table (32-bit pointer) + 13 - 14 Thread Group Count (64-bit pointer) + 15 GDS Range + ============= ================================ + +Graphics User Data +~~~~~~~~~~~~~~~~~~ + +Graphics pipelines support a much more flexible user data mapping: + + .. table:: PAL Graphics Shader User Data Registers + :name: pal-graphics-user-data-registers + + ============= ================================ + User Register Description + ============= ================================ + 0 Global Internal Table (32-bit pointer) + + Per-Shader Internal Table (32-bit pointer) + + 1-15 Application Controlled User Data + (1-15 Contiguous 32-bit Values in Registers) + + Spill Table (32-bit pointer) + + Draw Index (First Stage Only) + + Vertex Offset (First Stage Only) + + Instance Offset (First Stage Only) + ============= ================================ + + The placement of the global internal table remains fixed in the first *user + data SGPR register*. Otherwise all parameters are optional, and can be mapped + to any desired *user data SGPR register*, with the following restrictions: + + * Draw Index, Vertex Offset, and Instance Offset can only be used by the first + active hardware stage in a graphics pipeline (i.e. where the API vertex + shader runs). + + * Application-controlled user data must be mapped into a contiguous range of + user data registers. + + * The application-controlled user data range supports compaction remapping, so + only *entries* that are actually consumed by the shader must be assigned to + corresponding *registers*. Note that in order to support an efficient runtime + implementation, the remapping must pack *registers* in the same order as + *entries*, with unused *entries* removed. + +.. _pal_global_internal_table: + +Global Internal Table +~~~~~~~~~~~~~~~~~~~~~ + +The global internal table is a table of *shader resource descriptors* (SRDs) +that define how certain engine-wide, runtime-managed resources should be +accessed from a shader. The majority of these resources have HW-defined formats, +and it is up to the compiler to write/read data as required by the target +hardware. + +The following table illustrates the required format: + + .. table:: PAL Global Internal Table + :name: pal-git-table + + ============= ================================ + Offset Description + ============= ================================ + 0-3 Graphics Scratch SRD + 4-7 Compute Scratch SRD + 8-11 ES/GS Ring Output SRD + 12-15 ES/GS Ring Input SRD + 16-19 GS/VS Ring Output #0 + 20-23 GS/VS Ring Output #1 + 24-27 GS/VS Ring Output #2 + 28-31 GS/VS Ring Output #3 + 32-35 GS/VS Ring Input SRD + 36-39 Tessellation Factor Buffer SRD + 40-43 Off-Chip LDS Buffer SRD + 44-47 Off-Chip Param Cache Buffer SRD + 48-51 Sample Position Buffer SRD + 52 vaRange::ShadowDescriptorTable High Bits + ============= ================================ + + The pointer to the global internal table passed to the shader as user data + is a 32-bit pointer. The top 32 bits should be assumed to be the same as + the top 32 bits of the pipeline, so the shader may use the program + counter's top 32 bits. + +Unspecified OS +-------------- + +This section provides code conventions used when the target triple OS is +empty (see :ref:`amdgpu-target-triples`). + +Trap Handler ABI +~~~~~~~~~~~~~~~~ + +For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does +not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` +instructions are handled as follows: + + .. table:: AMDGPU Trap Handler for Non-AMDHSA OS + :name: amdgpu-trap-handler-for-non-amdhsa-os-table + + =============== =============== =========================================== + Usage Code Sequence Description + =============== =============== =========================================== + llvm.trap s_endpgm Causes wavefront to be terminated. + llvm.debugtrap *none* Compiler warning given that there is no + trap handler installed. + =============== =============== =========================================== + +Source Languages +================ + +.. _amdgpu-opencl: + +OpenCL +------ + +When the language is OpenCL the following differences occur: + +1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). +2. The AMDGPU backend appends additional arguments to the kernel's explicit + arguments for the AMDHSA OS (see + :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). +3. Additional metadata is generated + (see :ref:`amdgpu-amdhsa-code-object-metadata`). + + .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS + :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table + + ======== ==== ========= =========================================== + Position Byte Byte Description + Size Alignment + ======== ==== ========= =========================================== + 1 8 8 OpenCL Global Offset X + 2 8 8 OpenCL Global Offset Y + 3 8 8 OpenCL Global Offset Z + 4 8 8 OpenCL address of printf buffer + 5 8 8 OpenCL address of virtual queue used by + enqueue_kernel. + 6 8 8 OpenCL address of AqlWrap struct used by + enqueue_kernel. + 7 8 8 Pointer argument used for Multi-gird + synchronization. + ======== ==== ========= =========================================== + +.. _amdgpu-hcc: + +HCC +--- + +When the language is HCC the following differences occur: + +1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). + +.. _amdgpu-assembler: + +Assembler +--------- + +AMDGPU backend has LLVM-MC based assembler which is currently in development. +It supports AMDGCN GFX6-GFX10. + +This section describes general syntax for instructions and operands. + +Instructions +~~~~~~~~~~~~ + +An instruction has the following :doc:`syntax`: + + | ``<``\ *opcode*\ ``> <``\ *operand0*\ ``>, <``\ *operand1*\ ``>,... + <``\ *modifier0*\ ``> <``\ *modifier1*\ ``>...`` + +:doc:`Operands` are comma-separated while +:doc:`modifiers` are space-separated. + +The order of operands and modifiers is fixed. +Most modifiers are optional and may be omitted. + +Links to detailed instruction syntax description may be found in the following +table. Note that features under development are not included +in this description. + + =================================== ======================================= + Core ISA ISA Extensions + =================================== ======================================= + :doc:`GFX7` \- + :doc:`GFX8` \- + :doc:`GFX9` :doc:`gfx900` + + :doc:`gfx902` + + :doc:`gfx904` + + :doc:`gfx906` + + :doc:`gfx908` + + :doc:`gfx909` + + :doc:`GFX10` :doc:`gfx1011` + + :doc:`gfx1012` + =================================== ======================================= + +For more information about instructions, their semantics and supported +combinations of operands, refer to one of instruction set architecture manuals +[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_, [AMD-GCN-GFX9]_ and +[AMD-GCN-GFX10]_. + +Operands +~~~~~~~~ + +Detailed description of operands may be found :doc:`here`. + +Modifiers +~~~~~~~~~ + +Detailed description of modifiers may be found +:doc:`here`. + +Instruction Examples +~~~~~~~~~~~~~~~~~~~~ + +DS +++ + +.. code-block:: nasm + + ds_add_u32 v2, v4 offset:16 + ds_write_src2_b64 v2 offset0:4 offset1:8 + ds_cmpst_f32 v2, v4, v6 + ds_min_rtn_f64 v[8:9], v2, v[4:5] + +For full list of supported instructions, refer to "LDS/GDS instructions" in ISA +Manual. + +FLAT +++++ + +.. code-block:: nasm + + flat_load_dword v1, v[3:4] + flat_store_dwordx3 v[3:4], v[5:7] + flat_atomic_swap v1, v[3:4], v5 glc + flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc + flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc + +For full list of supported instructions, refer to "FLAT instructions" in ISA +Manual. + +MUBUF ++++++ + +.. code-block:: nasm + + buffer_load_dword v1, off, s[4:7], s1 + buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe + buffer_store_format_xy v[1:2], off, s[4:7], s1 + buffer_wbinvl1 + buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc + +For full list of supported instructions, refer to "MUBUF Instructions" in ISA +Manual. + +SMRD/SMEM ++++++++++ + +.. code-block:: nasm + + s_load_dword s1, s[2:3], 0xfc + s_load_dwordx8 s[8:15], s[2:3], s4 + s_load_dwordx16 s[88:103], s[2:3], s4 + s_dcache_inv_vol + s_memtime s[4:5] + +For full list of supported instructions, refer to "Scalar Memory Operations" in +ISA Manual. + +SOP1 +++++ + +.. code-block:: nasm + + s_mov_b32 s1, s2 + s_mov_b64 s[0:1], 0x80000000 + s_cmov_b32 s1, 200 + s_wqm_b64 s[2:3], s[4:5] + s_bcnt0_i32_b64 s1, s[2:3] + s_swappc_b64 s[2:3], s[4:5] + s_cbranch_join s[4:5] + +For full list of supported instructions, refer to "SOP1 Instructions" in ISA +Manual. + +SOP2 +++++ + +.. code-block:: nasm + + s_add_u32 s1, s2, s3 + s_and_b64 s[2:3], s[4:5], s[6:7] + s_cselect_b32 s1, s2, s3 + s_andn2_b32 s2, s4, s6 + s_lshr_b64 s[2:3], s[4:5], s6 + s_ashr_i32 s2, s4, s6 + s_bfm_b64 s[2:3], s4, s6 + s_bfe_i64 s[2:3], s[4:5], s6 + s_cbranch_g_fork s[4:5], s[6:7] + +For full list of supported instructions, refer to "SOP2 Instructions" in ISA +Manual. + +SOPC +++++ + +.. code-block:: nasm + + s_cmp_eq_i32 s1, s2 + s_bitcmp1_b32 s1, s2 + s_bitcmp0_b64 s[2:3], s4 + s_setvskip s3, s5 + +For full list of supported instructions, refer to "SOPC Instructions" in ISA +Manual. + +SOPP +++++ + +.. code-block:: nasm + + s_barrier + s_nop 2 + s_endpgm + s_waitcnt 0 ; Wait for all counters to be 0 + s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above + s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. + s_sethalt 9 + s_sleep 10 + s_sendmsg 0x1 + s_sendmsg sendmsg(MSG_INTERRUPT) + s_trap 1 + +For full list of supported instructions, refer to "SOPP Instructions" in ISA +Manual. + +Unless otherwise mentioned, little verification is performed on the operands +of SOPP Instructions, so it is up to the programmer to be familiar with the +range or acceptable values. + +VALU +++++ + +For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), +the assembler will automatically use optimal encoding based on its operands. To +force specific encoding, one can add a suffix to the opcode of the instruction: + +* _e32 for 32-bit VOP1/VOP2/VOPC +* _e64 for 64-bit VOP3 +* _dpp for VOP_DPP +* _sdwa for VOP_SDWA + +VOP1/VOP2/VOP3/VOPC examples: + +.. code-block:: nasm + + v_mov_b32 v1, v2 + v_mov_b32_e32 v1, v2 + v_nop + v_cvt_f64_i32_e32 v[1:2], v2 + v_floor_f32_e32 v1, v2 + v_bfrev_b32_e32 v1, v2 + v_add_f32_e32 v1, v2, v3 + v_mul_i32_i24_e64 v1, v2, 3 + v_mul_i32_i24_e32 v1, -3, v3 + v_mul_i32_i24_e32 v1, -100, v3 + v_addc_u32 v1, s[0:1], v2, v3, s[2:3] + v_max_f16_e32 v1, v2, v3 + +VOP_DPP examples: + +.. code-block:: nasm + + v_mov_b32 v0, v0 quad_perm:[0,2,1,1] + v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 + v_mov_b32 v0, v0 wave_shl:1 + v_mov_b32 v0, v0 row_mirror + v_mov_b32 v0, v0 row_bcast:31 + v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 + v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 + v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 + +VOP_SDWA examples: + +.. code-block:: nasm + + v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD + v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD + v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 + v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 + v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 + +For full list of supported instructions, refer to "Vector ALU instructions". + +.. TODO:: + + Remove once we switch to code object v3 by default. + +.. _amdgpu-amdhsa-assembler-predefined-symbols-v2: + +Code Object V2 Predefined Symbols (-mattr=-code-object-v3) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: Code Object V2 is not the default code object version emitted by + this version of LLVM. For a description of the predefined symbols available + with the default configuration (Code Object V3) see + :ref:`amdgpu-amdhsa-assembler-predefined-symbols-v3`. + +The AMDGPU assembler defines and updates some symbols automatically. These +symbols do not affect code generation. + +.option.machine_version_major ++++++++++++++++++++++++++++++ + +Set to the GFX major generation number of the target being assembled for. For +example, when assembling for a "GFX9" target this will be set to the integer +value "9". The possible GFX major generation numbers are presented in +:ref:`amdgpu-processors`. + +.option.machine_version_minor ++++++++++++++++++++++++++++++ + +Set to the GFX minor generation number of the target being assembled for. For +example, when assembling for a "GFX810" target this will be set to the integer +value "1". The possible GFX minor generation numbers are presented in +:ref:`amdgpu-processors`. + +.option.machine_version_stepping +++++++++++++++++++++++++++++++++ + +Set to the GFX stepping generation number of the target being assembled for. +For example, when assembling for a "GFX704" target this will be set to the +integer value "4". The possible GFX stepping generation numbers are presented +in :ref:`amdgpu-processors`. + +.kernel.vgpr_count +++++++++++++++++++ + +Set to zero each time a +:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is +encountered. At each instruction, if the current value of this symbol is less +than or equal to the maximum VPGR number explicitly referenced within that +instruction then the symbol value is updated to equal that VGPR number plus +one. + +.kernel.sgpr_count +++++++++++++++++++ + +Set to zero each time a +:ref:`amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel` directive is +encountered. At each instruction, if the current value of this symbol is less +than or equal to the maximum VPGR number explicitly referenced within that +instruction then the symbol value is updated to equal that SGPR number plus +one. + +.. _amdgpu-amdhsa-assembler-directives-v2: + +Code Object V2 Directives (-mattr=-code-object-v3) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: Code Object V2 is not the default code object version emitted by + this version of LLVM. For a description of the directives supported with + the default configuration (Code Object V3) see + :ref:`amdgpu-amdhsa-assembler-directives-v3`. + +AMDGPU ABI defines auxiliary data in output code object. In assembly source, +one can specify them with assembler directives. + +.hsa_code_object_version major, minor ++++++++++++++++++++++++++++++++++++++ + +*major* and *minor* are integers that specify the version of the HSA code +object that will be generated by the assembler. + +.hsa_code_object_isa [major, minor, stepping, vendor, arch] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + + +*major*, *minor*, and *stepping* are all integers that describe the instruction +set architecture (ISA) version of the assembly program. + +*vendor* and *arch* are quoted strings. *vendor* should always be equal to +"AMD" and *arch* should always be equal to "AMDGPU". + +By default, the assembler will derive the ISA version, *vendor*, and *arch* +from the value of the -mcpu option that is passed to the assembler. + +.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: + +.amdgpu_hsa_kernel (name) ++++++++++++++++++++++++++ + +This directives specifies that the symbol with given name is a kernel entry +point (label) and the object should contain corresponding symbol of type +STT_AMDGPU_HSA_KERNEL. + +.amd_kernel_code_t +++++++++++++++++++ + +This directive marks the beginning of a list of key / value pairs that are used +to specify the amd_kernel_code_t object that will be emitted by the assembler. +The list must be terminated by the *.end_amd_kernel_code_t* directive. For any +amd_kernel_code_t values that are unspecified a default value will be used. The +default value for all keys is 0, with the following exceptions: + +- *amd_code_version_major* defaults to 1. +- *amd_kernel_code_version_minor* defaults to 2. +- *amd_machine_kind* defaults to 1. +- *amd_machine_version_major*, *machine_version_minor*, and + *amd_machine_version_stepping* are derived from the value of the -mcpu option + that is passed to the assembler. +- *kernel_code_entry_byte_offset* defaults to 256. +- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards + defaults to 6 if target feature ``wavefrontsize64`` is enabled, otherwise 5. + Note that wavefront size is specified as a power of two, so a value of **n** + means a size of 2^ **n**. +- *call_convention* defaults to -1. +- *kernarg_segment_alignment*, *group_segment_alignment*, and + *private_segment_alignment* default to 4. Note that alignments are specified + as a power of 2, so a value of **n** means an alignment of 2^ **n**. +- *enable_wgp_mode* defaults to 1 if target feature ``cumode`` is disabled for + GFX10 onwards. +- *enable_mem_ordered* defaults to 1 for GFX10 onwards. + +The *.amd_kernel_code_t* directive must be placed immediately after the +function label and before any instructions. + +For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, +comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. + +.. _amdgpu-amdhsa-assembler-example-v2: + +Code Object V2 Example Source Code (-mattr=-code-object-v3) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: Code Object V2 is not the default code object version emitted by + this version of LLVM. For a description of the directives supported with + the default configuration (Code Object V3) see + :ref:`amdgpu-amdhsa-assembler-example-v3`. + +Here is an example of a minimal assembly source file, defining one HSA kernel: + +.. code:: + :number-lines: + + .hsa_code_object_version 1,0 + .hsa_code_object_isa + + .hsatext + .globl hello_world + .p2align 8 + .amdgpu_hsa_kernel hello_world + + hello_world: + + .amd_kernel_code_t + enable_sgpr_kernarg_segment_ptr = 1 + is_ptr64 = 1 + compute_pgm_rsrc1_vgprs = 0 + compute_pgm_rsrc1_sgprs = 0 + compute_pgm_rsrc2_user_sgpr = 2 + compute_pgm_rsrc1_wgp_mode = 0 + compute_pgm_rsrc1_mem_ordered = 0 + compute_pgm_rsrc1_fwd_progress = 1 + .end_amd_kernel_code_t + + s_load_dwordx2 s[0:1], s[0:1] 0x0 + v_mov_b32 v0, 3.14159 + s_waitcnt lgkmcnt(0) + v_mov_b32 v1, s0 + v_mov_b32 v2, s1 + flat_store_dword v[1:2], v0 + s_endpgm + .Lfunc_end0: + .size hello_world, .Lfunc_end0-hello_world + +.. _amdgpu-amdhsa-assembler-predefined-symbols-v3: + +Code Object V3 Predefined Symbols (-mattr=+code-object-v3) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The AMDGPU assembler defines and updates some symbols automatically. These +symbols do not affect code generation. + +.amdgcn.gfx_generation_number ++++++++++++++++++++++++++++++ + +Set to the GFX major generation number of the target being assembled for. For +example, when assembling for a "GFX9" target this will be set to the integer +value "9". The possible GFX major generation numbers are presented in +:ref:`amdgpu-processors`. + +.amdgcn.gfx_generation_minor +++++++++++++++++++++++++++++ + +Set to the GFX minor generation number of the target being assembled for. For +example, when assembling for a "GFX810" target this will be set to the integer +value "1". The possible GFX minor generation numbers are presented in +:ref:`amdgpu-processors`. + +.amdgcn.gfx_generation_stepping ++++++++++++++++++++++++++++++++ + +Set to the GFX stepping generation number of the target being assembled for. +For example, when assembling for a "GFX704" target this will be set to the +integer value "4". The possible GFX stepping generation numbers are presented +in :ref:`amdgpu-processors`. + +.. _amdgpu-amdhsa-assembler-symbol-next_free_vgpr: + +.amdgcn.next_free_vgpr +++++++++++++++++++++++ + +Set to zero before assembly begins. At each instruction, if the current value +of this symbol is less than or equal to the maximum VGPR number explicitly +referenced within that instruction then the symbol value is updated to equal +that VGPR number plus one. + +May be used to set the `.amdhsa_next_free_vpgr` directive in +:ref:`amdhsa-kernel-directives-table`. + +May be set at any time, e.g. manually set to zero at the start of each kernel. + +.. _amdgpu-amdhsa-assembler-symbol-next_free_sgpr: + +.amdgcn.next_free_sgpr +++++++++++++++++++++++ + +Set to zero before assembly begins. At each instruction, if the current value +of this symbol is less than or equal the maximum SGPR number explicitly +referenced within that instruction then the symbol value is updated to equal +that SGPR number plus one. + +May be used to set the `.amdhsa_next_free_spgr` directive in +:ref:`amdhsa-kernel-directives-table`. + +May be set at any time, e.g. manually set to zero at the start of each kernel. + +.. _amdgpu-amdhsa-assembler-directives-v3: + +Code Object V3 Directives (-mattr=+code-object-v3) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` +architecture processors, and are not OS-specific. Directives which begin with +``.amdhsa`` are specific to ``amdgcn`` architecture processors when the +``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and +:ref:`amdgpu-processors`. + +.amdgcn_target ++++++++++++++++++++++++ + +Optional directive which declares the target supported by the containing +assembler source file. Valid values are described in +:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler +to validate command-line options such as ``-triple``, ``-mcpu``, and those +which specify target features. + +.amdhsa_kernel ++++++++++++++++++++++ + +Creates a correctly aligned AMDHSA kernel descriptor and a symbol, +``.kd``, in the current location of the current section. Only valid when +the OS is ``amdhsa``. ```` must be a symbol that labels the first +instruction to execute, and does not need to be previously defined. + +Marks the beginning of a list of directives used to generate the bytes of a +kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. +Directives which may appear in this list are described in +:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must +be valid for the target being assembled for, and cannot be repeated. Directives +support the range of values specified by the field they reference in +:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is +assumed to have its default value, unless it is marked as "Required", in which +case it is an error to omit the directive. This list of directives is +terminated by an ``.end_amdhsa_kernel`` directive. + + .. table:: AMDHSA Kernel Assembler Directives + :name: amdhsa-kernel-directives-table + + ======================================================== =================== ============ =================== + Directive Default Supported On Description + ======================================================== =================== ============ =================== + ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX10 Controls GROUP_SEGMENT_FIXED_SIZE in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX10 Controls PRIVATE_SEGMENT_FIXED_SIZE in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_PTR in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_QUEUE_PTR in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX10 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX10 Controls ENABLE_SGPR_DISPATCH_ID in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX10 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in + :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + ``.amdhsa_wavefront_size32`` Target GFX10 Controls ENABLE_WAVEFRONT_SIZE32 in + Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + Specific + (-wavefrontsize64) + ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX10 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_X in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Y in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_ID_Z in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX10 Controls ENABLE_SGPR_WORKGROUP_INFO in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX10 Controls ENABLE_VGPR_WORKITEM_ID in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. + ``.amdhsa_next_free_vgpr`` Required GFX6-GFX10 Maximum VGPR number explicitly referenced, plus one. + Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_next_free_sgpr`` Required GFX6-GFX10 Maximum SGPR number explicitly referenced, plus one. + Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_reserve_vcc`` 1 GFX6-GFX10 Whether the kernel may use the special VCC SGPR. + Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX10 Whether the kernel may use flat instructions to access + scratch memory. Used to calculate + GRANULATED_WAVEFRONT_SGPR_COUNT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX10 Whether the kernel may trigger XNACK replay. + Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in + Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + (+xnack) + ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_32 in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. + ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX10 Controls FLOAT_ROUND_MODE_16_64 in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. + ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX10 Controls FLOAT_DENORM_MODE_32 in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. + ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX10 Controls FLOAT_DENORM_MODE_16_64 in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + Possible values are defined in + :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. + ``.amdhsa_dx10_clamp`` 1 GFX6-GFX10 Controls ENABLE_DX10_CLAMP in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_ieee_mode`` 1 GFX6-GFX10 Controls ENABLE_IEEE_MODE in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_fp16_overflow`` 0 GFX9-GFX10 Controls FP16_OVFL in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_workgroup_processor_mode`` Target GFX10 Controls ENABLE_WGP_MODE in + Feature :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx10-table`. + Specific + (-cumode) + ``.amdhsa_memory_ordered`` 1 GFX10 Controls MEM_ORDERED in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_forward_progress`` 0 GFX10 Controls FWD_PROGRESS in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX10 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in + :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx10-table`. + ======================================================== =================== ============ =================== + +.amdgpu_metadata +++++++++++++++++ + +Optional directive which declares the contents of the ``NT_AMDGPU_METADATA`` +note record (see :ref:`amdgpu-elf-note-records-table-v3`). + +The contents must be in the [YAML]_ markup format, with the same structure and +semantics described in :ref:`amdgpu-amdhsa-code-object-metadata-v3`. + +This directive is terminated by an ``.end_amdgpu_metadata`` directive. + +.. _amdgpu-amdhsa-assembler-example-v3: + +Code Object V3 Example Source Code (-mattr=+code-object-v3) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here is an example of a minimal assembly source file, defining one HSA kernel: + +.. code:: + :number-lines: + + .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional + + .text + .globl hello_world + .p2align 8 + .type hello_world,@function + hello_world: + s_load_dwordx2 s[0:1], s[0:1] 0x0 + v_mov_b32 v0, 3.14159 + s_waitcnt lgkmcnt(0) + v_mov_b32 v1, s0 + v_mov_b32 v2, s1 + flat_store_dword v[1:2], v0 + s_endpgm + .Lfunc_end0: + .size hello_world, .Lfunc_end0-hello_world + + .rodata + .p2align 6 + .amdhsa_kernel hello_world + .amdhsa_user_sgpr_kernarg_segment_ptr 1 + .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr + .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr + .end_amdhsa_kernel + + .amdgpu_metadata + --- + amdhsa.version: + - 1 + - 0 + amdhsa.kernels: + - .name: hello_world + .symbol: hello_world.kd + .kernarg_segment_size: 48 + .group_segment_fixed_size: 0 + .private_segment_fixed_size: 0 + .kernarg_segment_align: 4 + .wavefront_size: 64 + .sgpr_count: 2 + .vgpr_count: 3 + .max_flat_workgroup_size: 256 + ... + .end_amdgpu_metadata + +If an assembly source file contains multiple kernels and/or functions, the +:ref:`amdgpu-amdhsa-assembler-symbol-next_free_vgpr` and +:ref:`amdgpu-amdhsa-assembler-symbol-next_free_sgpr` symbols may be reset using +the ``.set , `` directive. For example, in the case of two +kernels, where ``function1`` is only called from ``kernel1`` it is sufficient +to group the function with the kernel that calls it and reset the symbols +between the two connected components: + +.. code:: + :number-lines: + + .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional + + // gpr tracking symbols are implicitly set to zero + + .text + .globl kern0 + .p2align 8 + .type kern0,@function + kern0: + // ... + s_endpgm + .Lkern0_end: + .size kern0, .Lkern0_end-kern0 + + .rodata + .p2align 6 + .amdhsa_kernel kern0 + // ... + .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr + .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr + .end_amdhsa_kernel + + // reset symbols to begin tracking usage in func1 and kern1 + .set .amdgcn.next_free_vgpr, 0 + .set .amdgcn.next_free_sgpr, 0 + + .text + .hidden func1 + .global func1 + .p2align 2 + .type func1,@function + func1: + // ... + s_setpc_b64 s[30:31] + .Lfunc1_end: + .size func1, .Lfunc1_end-func1 + + .globl kern1 + .p2align 8 + .type kern1,@function + kern1: + // ... + s_getpc_b64 s[4:5] + s_add_u32 s4, s4, func1@rel32@lo+4 + s_addc_u32 s5, s5, func1@rel32@lo+4 + s_swappc_b64 s[30:31], s[4:5] + // ... + s_endpgm + .Lkern1_end: + .size kern1, .Lkern1_end-kern1 + + .rodata + .p2align 6 + .amdhsa_kernel kern1 + // ... + .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr + .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr + .end_amdhsa_kernel + +These symbols cannot identify connected components in order to automatically +track the usage for each kernel. However, in some cases careful organization of +the kernels and functions in the source file means there is minimal additional +effort required to accurately calculate GPR usage. + +Additional Documentation +======================== + +.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA `__ +.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA `_ +.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture `__ +.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture `__ +.. [AMD-GCN-GFX10] `AMD "RDNA 1.0" Instruction Set Architecture `__ +.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA `__ +.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA `__ +.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA `__ +.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA `__ +.. [AMD-ROCm] `AMD ROCm Platform `__ +.. [AMD-ROCm-github] `ROCm github `__ +.. [CLANG-ATTR] `Attributes in Clang `__ +.. [DWARF] `DWARF Debugging Information Format `__ +.. [ELF] `Executable and Linkable Format (ELF) `__ +.. [HRF] `Heterogeneous-race-free Memory Models `__ +.. [HSA] `Heterogeneous System Architecture (HSA) Foundation `__ +.. [MsgPack] `Message Pack `__ +.. [OpenCL] `The OpenCL Specification Version 2.0 `__ +.. [SEMVER] `Semantic Versioning `__ +.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 `__