+ +
+

User Guide for AMDGPU Back-end

+
+

Introduction

+

The AMDGPU back-end provides ISA code generation for AMD GPUs, starting with +the R600 family up until the current Volcanic Islands (GCN Gen 3).

+

The following additional documentation is available:

+
    +
  • AMD GCN3 Instruction Set Architecture: PDF
  • +
  • Southern Islands Series Instruction Set Architecture: PDF
  • +
  • R600-Family Instruction Set Architecture: PDF
  • +
  • AMDGPU Compute Application Binary Interface: MD
  • +
+
+
+

Conventions

+
+

Address Spaces

+

The AMDGPU back-end uses the following address space mapping:

+
+
++++ + + + + + + + + + + + + + + + + + + + + + + + + + +
Address SpaceMemory Space
0Private
1Global
2Constant
3Local
4Generic (Flat)
5Region
+
+

The terminology in the table, aside from the region memory space, is from the +OpenCL standard.

+
+
+
+

Assembler

+

AMDGPU backend has LLVM-MC based assembler which is currently in development. +It supports Southern Islands ISA, Sea Islands and Volcanic Islands.

+

This document describes general syntax for instructions and operands. For more +information about instructions, their semantics and supported combinations +of operands, refer to one of Instruction Set Architecture manuals.

+

An instruction has the following syntax (register operands are +normally comma-separated while extra operands are space-separated):

+

<opcode> <register_operand0>, ... <extra_operand0> ...

+
+

Operands

+

The following syntax for register operands is supported:

+
    +
  • SGPR registers: s0, ... or s[0], ...
  • +
  • VGPR registers: v0, ... or v[0], ...
  • +
  • TTMP registers: ttmp0, ... or ttmp[0], ...
  • +
  • Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
  • +
  • Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
  • +
  • Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
  • +
  • Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
  • +
  • Register index expressions: v[2*2], s[1-1:2-1]
  • +
  • off indicates that an operand is not enabled
  • +
+

The following extra operands are supported:

+
    +
  • offset, offset0, offset1
  • +
  • idxen, offen bits
  • +
  • glc, slc, tfe bits
  • +
  • waitcnt: integer or combination of counter values
  • +
  • VOP3 modifiers:
      +
    • abs (| |), neg (-)
    • +
    +
  • +
  • DPP modifiers:
      +
    • row_shl, row_shr, row_ror, row_rol
    • +
    • row_mirror, row_half_mirror, row_bcast
    • +
    • wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
    • +
    • row_mask, bank_mask, bound_ctrl
    • +
    +
  • +
  • SDWA modifiers:
      +
    • dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
    • +
    • dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
    • +
    • abs, neg, sext
    • +
    +
  • +
+
+
+

DS Instructions Examples

+
ds_add_u32 v2, v4 offset:16
+ds_write_src2_b64 v2 offset0:4 offset1:8
+ds_cmpst_f32 v2, v4, v6
+ds_min_rtn_f64 v[8:9], v2, v[4:5]
+
+
+

For full list of supported instructions, refer to “LDS/GDS instructions” in ISA Manual.

+
+
+

FLAT Instruction Examples

+
flat_load_dword v1, v[3:4]
+flat_store_dwordx3 v[3:4], v[5:7]
+flat_atomic_swap v1, v[3:4], v5 glc
+flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
+flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
+
+
+

For full list of supported instructions, refer to “FLAT instructions” in ISA Manual.

+
+
+

MUBUF Instruction Examples

+
buffer_load_dword v1, off, s[4:7], s1
+buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
+buffer_store_format_xy v[1:2], off, s[4:7], s1
+buffer_wbinvl1
+buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
+
+
+

For full list of supported instructions, refer to “MUBUF Instructions” in ISA Manual.

+
+
+

SMRD/SMEM Instruction Examples

+
s_load_dword s1, s[2:3], 0xfc
+s_load_dwordx8 s[8:15], s[2:3], s4
+s_load_dwordx16 s[88:103], s[2:3], s4
+s_dcache_inv_vol
+s_memtime s[4:5]
+
+
+

For full list of supported instructions, refer to “Scalar Memory Operations” in ISA Manual.

+
+
+

SOP1 Instruction Examples

+
s_mov_b32 s1, s2
+s_mov_b64 s[0:1], 0x80000000
+s_cmov_b32 s1, 200
+s_wqm_b64 s[2:3], s[4:5]
+s_bcnt0_i32_b64 s1, s[2:3]
+s_swappc_b64 s[2:3], s[4:5]
+s_cbranch_join s[4:5]
+
+
+

For full list of supported instructions, refer to “SOP1 Instructions” in ISA Manual.

+
+
+

SOP2 Instruction Examples

+
s_add_u32 s1, s2, s3
+s_and_b64 s[2:3], s[4:5], s[6:7]
+s_cselect_b32 s1, s2, s3
+s_andn2_b32 s2, s4, s6
+s_lshr_b64 s[2:3], s[4:5], s6
+s_ashr_i32 s2, s4, s6
+s_bfm_b64 s[2:3], s4, s6
+s_bfe_i64 s[2:3], s[4:5], s6
+s_cbranch_g_fork s[4:5], s[6:7]
+
+
+

For full list of supported instructions, refer to “SOP2 Instructions” in ISA Manual.

+
+
+

SOPC Instruction Examples

+
s_cmp_eq_i32 s1, s2
+s_bitcmp1_b32 s1, s2
+s_bitcmp0_b64 s[2:3], s4
+s_setvskip s3, s5
+
+
+

For full list of supported instructions, refer to “SOPC Instructions” in ISA Manual.

+
+
+

SOPP Instruction Examples

+
s_barrier
+s_nop 2
+s_endpgm
+s_waitcnt 0 ; Wait for all counters to be 0
+s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
+s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
+s_sethalt 9
+s_sleep 10
+s_sendmsg 0x1
+s_sendmsg sendmsg(MSG_INTERRUPT)
+s_trap 1
+
+
+

For full list of supported instructions, refer to “SOPP Instructions” in ISA Manual.

+

Unless otherwise mentioned, little verification is performed on the operands +of SOPP Instrucitons, so it is up to the programmer to be familiar with the +range or acceptable values.

+
+
+

Vector ALU Instruction Examples

+

For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), +the assembler will automatically use optimal encoding based on its operands. +To force specific encoding, one can add a suffix to the opcode of the instruction:

+
    +
  • _e32 for 32-bit VOP1/VOP2/VOPC
  • +
  • _e64 for 64-bit VOP3
  • +
  • _dpp for VOP_DPP
  • +
  • _sdwa for VOP_SDWA
  • +
+

VOP1/VOP2/VOP3/VOPC examples:

+
v_mov_b32 v1, v2
+v_mov_b32_e32 v1, v2
+v_nop
+v_cvt_f64_i32_e32 v[1:2], v2
+v_floor_f32_e32 v1, v2
+v_bfrev_b32_e32 v1, v2
+v_add_f32_e32 v1, v2, v3
+v_mul_i32_i24_e64 v1, v2, 3
+v_mul_i32_i24_e32 v1, -3, v3
+v_mul_i32_i24_e32 v1, -100, v3
+v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
+v_max_f16_e32 v1, v2, v3
+
+
+

VOP_DPP examples:

+
v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
+v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
+v_mov_b32 v0, v0 wave_shl:1
+v_mov_b32 v0, v0 row_mirror
+v_mov_b32 v0, v0 row_bcast:31
+v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
+v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
+v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
+
+
+

VOP_SDWA examples:

+
v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
+v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
+v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
+v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
+v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
+
+
+

For full list of supported instructions, refer to “Vector ALU instructions”.

+
+
+

HSA Code Object Directives

+

AMDGPU ABI defines auxiliary data in output code object. In assembly source, +one can specify them with assembler directives.

+
+

.hsa_code_object_version major, minor

+

major and minor are integers that specify the version of the HSA code +object that will be generated by the assembler.

+
+
+

.hsa_code_object_isa [major, minor, stepping, vendor, arch]

+

major, minor, and stepping are all integers that describe the instruction +set architecture (ISA) version of the assembly program.

+

vendor and arch are quoted strings. vendor should always be equal to +“AMD” and arch should always be equal to “AMDGPU”.

+

By default, the assembler will derive the ISA version, vendor, and arch +from the value of the -mcpu option that is passed to the assembler.

+
+
+

.amdgpu_hsa_kernel (name)

+

This directives specifies that the symbol with given name is a kernel entry point +(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.

+
+
+

.amd_kernel_code_t

+

This directive marks the beginning of a list of key / value pairs that are used +to specify the amd_kernel_code_t object that will be emitted by the assembler. +The list must be terminated by the .end_amd_kernel_code_t directive. For +any amd_kernel_code_t values that are unspecified a default value will be +used. The default value for all keys is 0, with the following exceptions:

+
    +
  • kernel_code_version_major defaults to 1.
  • +
  • machine_kind defaults to 1.
  • +
  • machine_version_major, machine_version_minor, and +machine_version_stepping are derived from the value of the -mcpu option +that is passed to the assembler.
  • +
  • kernel_code_entry_byte_offset defaults to 256.
  • +
  • wavefront_size defaults to 6.
  • +
  • kernarg_segment_alignment, group_segment_alignment, and +private_segment_alignment default to 4. Note that alignments are specified +as a power of two, so a value of n means an alignment of 2^ n.
  • +
+

The .amd_kernel_code_t directive must be placed immediately after the +function label and before any instructions.

+

For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, +comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.

+

Here is an example of a minimal amd_kernel_code_t specification:

+
.hsa_code_object_version 1,0
+.hsa_code_object_isa
+
+.hsatext
+.globl  hello_world
+.p2align 8
+.amdgpu_hsa_kernel hello_world
+
+hello_world:
+
+   .amd_kernel_code_t
+      enable_sgpr_kernarg_segment_ptr = 1
+      is_ptr64 = 1
+      compute_pgm_rsrc1_vgprs = 0
+      compute_pgm_rsrc1_sgprs = 0
+      compute_pgm_rsrc2_user_sgpr = 2
+      kernarg_segment_byte_size = 8
+      wavefront_sgpr_count = 2
+      workitem_vgpr_count = 3
+  .end_amd_kernel_code_t
+
+  s_load_dwordx2 s[0:1], s[0:1] 0x0
+  v_mov_b32 v0, 3.14159
+  s_waitcnt lgkmcnt(0)
+  v_mov_b32 v1, s0
+  v_mov_b32 v2, s1
+  flat_store_dword v[1:2], v0
+  s_endpgm
+.Lfunc_end0:
+     .size   hello_world, .Lfunc_end0-hello_world
+
+
+
+
+
+
+ + +