Skip to content

Commit 5bbcca6

Browse files
committedMar 8, 2018
[AMDGPU] Update AMDGOUUsage.rst descriptions
- Improve description of XNACK ELF flag. - Rename all uses of wave to wavefront to be consistent. Differential Revision: https://reviews.llvm.org/D43983 llvm-svn: 326989
1 parent 003be7c commit 5bbcca6

File tree

1 file changed

+32
-27
lines changed

1 file changed

+32
-27
lines changed
 

‎llvm/docs/AMDGPUUsage.rst

Lines changed: 32 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -503,6 +503,11 @@ The AMDGPU backend uses the following ELF header:
503503
target feature is
504504
enabled for all code
505505
contained in the code object.
506+
If the processor
507+
does not support the
508+
``xnack`` target
509+
feature then must
510+
be 0.
506511
See
507512
:ref:`amdgpu-target-features`.
508513
================================= ========== =============================
@@ -1455,7 +1460,7 @@ address to physical address is:
14551460
There are different ways that the wavefront scratch base address is determined
14561461
by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
14571462
memory can be accessed in an interleaved manner using buffer instruction with
1458-
the scratch buffer descriptor and per wave scratch offset, by the scratch
1463+
the scratch buffer descriptor and per wavefront scratch offset, by the scratch
14591464
instructions, or by flat instructions. If each lane of a wavefront accesses the
14601465
same private address, the interleaving results in adjacent dwords being accessed
14611466
and hence requires fewer cache lines to be fetched. Multi-dword access is not
@@ -1796,7 +1801,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
17961801
Bits Size Field Name Description
17971802
======= ======= =============================== ===========================================================================
17981803
0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1799-
_WAVE_OFFSET SGPR wave scratch offset
1804+
_WAVEFRONT_OFFSET SGPR wavefront scratch offset
18001805
system register (see
18011806
:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
18021807

@@ -1883,7 +1888,7 @@ CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
18831888
exceptions exceptions
18841889
enabled which are generated
18851890
when a memory violation has
1886-
occurred for this wave from
1891+
occurred for this wavefront from
18871892
L1 or LDS
18881893
(write-to-read-only-memory,
18891894
mis-aligned atomic, LDS
@@ -2007,10 +2012,10 @@ SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
20072012
an SGPR number.
20082013

20092014
The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2010-
all waves of the grid. It is possible to specify more than 16 User SGPRs using
2015+
all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
20112016
the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
20122017
initialized. These are then immediately followed by the System SGPRs that are
2013-
set up by ADC/SPI and can have different values for each wave of the grid
2018+
set up by ADC/SPI and can have different values for each wavefront of the grid
20142019
dispatch.
20152020

20162021
SGPR register initial state is defined in
@@ -2025,10 +2030,10 @@ SGPR register initial state is defined in
20252030
field) SGPRs
20262031
========== ========================== ====== ==============================
20272032
First Private Segment Buffer 4 V# that can be used, together
2028-
(enable_sgpr_private with Scratch Wave Offset as an
2029-
_segment_buffer) offset, to access the private
2030-
memory space using a segment
2031-
address.
2033+
(enable_sgpr_private with Scratch Wavefront Offset
2034+
_segment_buffer) as an offset, to access the
2035+
private memory space using a
2036+
segment address.
20322037

20332038
CP uses the value provided by
20342039
the runtime.
@@ -2068,7 +2073,7 @@ SGPR register initial state is defined in
20682073
address is
20692074
``SH_HIDDEN_PRIVATE_BASE_VIMID``
20702075
plus this offset.) The value
2071-
of Scratch Wave Offset must
2076+
of Scratch Wavefront Offset must
20722077
be added to this offset by
20732078
the kernel machine code,
20742079
right shifted by 8, and
@@ -2078,13 +2083,13 @@ SGPR register initial state is defined in
20782083
to SGPRn-4 on GFX7, and
20792084
SGPRn-6 on GFX8 (where SGPRn
20802085
is the highest numbered SGPR
2081-
allocated to the wave).
2086+
allocated to the wavefront).
20822087
FLAT_SCRATCH_HI is
20832088
multiplied by 256 (as it is
20842089
in units of 256 bytes) and
20852090
added to
20862091
``SH_HIDDEN_PRIVATE_BASE_VIMID``
2087-
to calculate the per wave
2092+
to calculate the per wavefront
20882093
FLAT SCRATCH BASE in flat
20892094
memory instructions that
20902095
access the scratch
@@ -2124,7 +2129,7 @@ SGPR register initial state is defined in
21242129
divides it if there are
21252130
multiple Shader Arrays each
21262131
with its own SPI). The value
2127-
of Scratch Wave Offset must
2132+
of Scratch Wavefront Offset must
21282133
be added by the kernel
21292134
machine code and the result
21302135
moved to the FLAT_SCRATCH
@@ -2193,12 +2198,12 @@ SGPR register initial state is defined in
21932198
then Work-Group Id Z 1 32 bit work-group id in Z
21942199
(enable_sgpr_workgroup_id dimension of grid for
21952200
_Z) wavefront.
2196-
then Work-Group Info 1 {first_wave, 14'b0000,
2201+
then Work-Group Info 1 {first_wavefront, 14'b0000,
21972202
(enable_sgpr_workgroup ordered_append_term[10:0],
2198-
_info) threadgroup_size_in_waves[5:0]}
2199-
then Scratch Wave Offset 1 32 bit byte offset from base
2203+
_info) threadgroup_size_in_wavefronts[5:0]}
2204+
then Scratch Wavefront Offset 1 32 bit byte offset from base
22002205
(enable_sgpr_private of scratch base of queue
2201-
_segment_wave_offset) executing the kernel
2206+
_segment_wavefront_offset) executing the kernel
22022207
dispatch. Must be used as an
22032208
offset with Private
22042209
segment address when using
@@ -2244,8 +2249,8 @@ The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
22442249
registers.
22452250
2. Work-group Id registers X, Y, Z are set by ADC which supports any
22462251
combination including none.
2247-
3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2248-
cannot included with the flat scratch init value which is per queue.
2252+
3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2253+
its value cannot included with the flat scratch init value which is per queue.
22492254
4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
22502255
or (X, Y, Z).
22512256

@@ -2293,7 +2298,7 @@ Flat Scratch
22932298

22942299
If the kernel may use flat operations to access scratch memory, the prolog code
22952300
must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2296-
are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2301+
are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
22972302
Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
22982303

22992304
GFX6
@@ -2304,7 +2309,7 @@ GFX7-GFX8
23042309
``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
23052310
being managed by SPI for the queue executing the kernel dispatch. This is
23062311
the same value used in the Scratch Segment Buffer V# base address. The
2307-
prolog must add the value of Scratch Wave Offset to get the wave's byte
2312+
prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
23082313
scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
23092314
FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
23102315
by 8 before moving into FLAT_SCRATCH_LO.
@@ -2318,7 +2323,7 @@ GFX7-GFX8
23182323
GFX9
23192324
The Flat Scratch Init is the 64 bit address of the base of scratch backing
23202325
memory being managed by SPI for the queue executing the kernel dispatch. The
2321-
prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2326+
prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
23222327
pair for use as the flat scratch base in flat memory instructions.
23232328

23242329
.. _amdgpu-amdhsa-memory-model:
@@ -2384,12 +2389,12 @@ For GFX6-GFX9:
23842389
global order and involve no caching. Completion is reported to a wavefront in
23852390
execution order.
23862391
* The LDS memory has multiple request queues shared by the SIMDs of a
2387-
CU. Therefore, the LDS operations performed by different waves of a work-group
2392+
CU. Therefore, the LDS operations performed by different wavefronts of a work-group
23882393
can be reordered relative to each other, which can result in reordering the
23892394
visibility of vector memory operations with respect to LDS operations of other
23902395
wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
23912396
ensure synchronization between LDS operations and vector memory operations
2392-
between waves of a work-group, but not between operations performed by the
2397+
between wavefronts of a work-group, but not between operations performed by the
23932398
same wavefront.
23942399
* The vector memory operations are performed as wavefront wide operations and
23952400
completion is reported to a wavefront in execution order. The exception is
@@ -2399,7 +2404,7 @@ For GFX6-GFX9:
23992404
* The vector memory operations access a single vector L1 cache shared by all
24002405
SIMDs a CU. Therefore, no special action is required for coherence between the
24012406
lanes of a single wavefront, or for coherence between wavefronts in the same
2402-
work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2407+
work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
24032408
executing in different work-groups as they may be executing on different CUs.
24042409
* The scalar memory operations access a scalar L1 cache shared by all wavefronts
24052410
on a group of CUs. The scalar and vector L1 caches are not coherent. However,
@@ -2410,7 +2415,7 @@ For GFX6-GFX9:
24102415
* The L2 cache has independent channels to service disjoint ranges of virtual
24112416
addresses.
24122417
* Each CU has a separate request queue per channel. Therefore, the vector and
2413-
scalar memory operations performed by waves executing in different work-groups
2418+
scalar memory operations performed by wavefronts executing in different work-groups
24142419
(which may be executing on different CUs) of an agent can be reordered
24152420
relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
24162421
synchronization between vector memory operations of different CUs. It ensures a
@@ -2460,7 +2465,7 @@ case the AMDGPU backend ensures the memory location used to spill is never
24602465
accessed by vector memory operations at the same time. If scalar writes are used
24612466
then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
24622467
return since the locations may be used for vector memory instructions by a
2463-
future wave that uses the same scratch area, or a function call that creates a
2468+
future wavefront that uses the same scratch area, or a function call that creates a
24642469
frame at the same address, respectively. There is no need for a ``s_dcache_inv``
24652470
as all scalar writes are write-before-read in the same thread.
24662471

0 commit comments

Comments
 (0)
Please sign in to comment.