This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add CPUCoherentL2 feature
AbandonedPublic

Authored by jvesely on Oct 26 2017, 3:58 PM.

Details

Reviewers
arsenm
t-tye
Summary

This will be used to generate SLC flag for atomics on dGPUs

Diff Detail

Repository
rL LLVM

Event Timeline

jvesely created this revision.Oct 26 2017, 3:58 PM

Is it going to be used with amdhsa os? Where can I find the spec for this?

t-tye edited edge metadata.Oct 26 2017, 8:42 PM

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

arsenm edited edge metadata.Oct 27 2017, 1:47 AM

This seems like a property of an individual operation, not a subtarget

Is it going to be used with amdhsa os? Where can I find the spec for this?

I planned to use this to enable system level atomics that work between CPU and dGPU. atm it only works on kaveri/carrizo.

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
Has this been changed?
I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
The ISA specs also don't mention what values are allowed in those 3 bits.

This seems like a property of an individual operation, not a subtarget

Why would that be? if I have a system scope atomic OP it needs SLC flag on dGPUs while it's OK on CC APUs.

Is it going to be used with amdhsa os? Where can I find the spec for this?

I planned to use this to enable system level atomics that work between CPU and dGPU. atm it only works on kaveri/carrizo.

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
Has this been changed?
I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
The ISA specs also don't mention what values are allowed in those 3 bits.

I do not think using SLC will achieve what you are looking for. The default memory policies for SLC are to enable STREAMING mode which leaves cache lines in the L2 cache and so will not achieve coherence. What you need is for the L2 cache to be kept coherent with the memory fabric which is what the MTYPE and IOMMUv2 can provide on APUs. The runtime can configure the hardware to do this. Buffer instructions use V# that can specify the MTYPE, and there are configuration registers that can be set for each aperture with the MTYPE to use.

What runtime are you intending to use to load and execute the code produced?

How where you thinking of controlling when to enable this as you would not want to affect the existing code generated as adding SLC will make code execute less performantly?

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
Has this been changed?
I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
The ISA specs also don't mention what values are allowed in those 3 bits.

I do not think using SLC will achieve what you are looking for. The default memory policies for SLC are to enable STREAMING mode which leaves cache lines in the L2 cache and so will not achieve coherence. What you need is for the L2 cache to be kept coherent with the memory fabric which is what the MTYPE and IOMMUv2 can provide on APUs. The runtime can configure the hardware to do this. Buffer instructions use V# that can specify the MTYPE, and there are configuration registers that can be set for each aperture with the MTYPE to use.

What runtime are you intending to use to load and execute the code produced?

How where you thinking of controlling when to enable this as you would not want to affect the existing code generated as adding SLC will make code execute less performantly?

For my specific use case (system calls) I use HCC, but I'd expect this to generally apply to system scope atomics.
The bug has been reported here [0], and I wrote simple atomic tests [1].

Configuring "System Level Coherent" flag to not guarantee system level coherence is a bit counter intuitive decision.
I'm not sure I follow the performance or MTYPE argument. Only system scope atomic operations need this, I'd expect them to be slow, and they should be used sparingly.
Can you point me to where MTYPE values are set (and documented)? I only found rsrc descriptor setup for scratch memory in ROCR.

[0] https://github.com/RadeonOpenCompute/hcc/issues/410
[1] https://github.com/jvesely/hcc-atomic-test

t-tye requested changes to this revision.Oct 28 2017, 10:41 AM

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
Has this been changed?
I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
The ISA specs also don't mention what values are allowed in those 3 bits.

I do not think using SLC will achieve what you are looking for. The default memory policies for SLC are to enable STREAMING mode which leaves cache lines in the L2 cache and so will not achieve coherence. What you need is for the L2 cache to be kept coherent with the memory fabric which is what the MTYPE and IOMMUv2 can provide on APUs. The runtime can configure the hardware to do this. Buffer instructions use V# that can specify the MTYPE, and there are configuration registers that can be set for each aperture with the MTYPE to use.

What runtime are you intending to use to load and execute the code produced?

How where you thinking of controlling when to enable this as you would not want to affect the existing code generated as adding SLC will make code execute less performantly?

For my specific use case (system calls) I use HCC, but I'd expect this to generally apply to system scope atomics.
The bug has been reported here [0], and I wrote simple atomic tests [1].

Configuring "System Level Coherent" flag to not guarantee system level coherence is a bit counter intuitive decision.
I'm not sure I follow the performance or MTYPE argument. Only system scope atomic operations need this, I'd expect them to be slow, and they should be used sparingly.
Can you point me to where MTYPE values are set (and documented)? I only found rsrc descriptor setup for scratch memory in ROCR.

[0] https://github.com/RadeonOpenCompute/hcc/issues/410
[1] https://github.com/jvesely/hcc-atomic-test

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings. The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

This revision now requires changes to proceed.Oct 28 2017, 10:41 AM
jvesely added a comment.EditedOct 28 2017, 8:08 PM

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.
Woudl the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

t-tye added a comment.EditedOct 29 2017, 8:57 AM

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the L1 cache.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the L1 and L2 caches.

Woudl the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

Is there a reason/use case to set it to streaming? Would it be too disruptive to change it?

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the L1 cache.

I don't want or need to implement full system memory coherence. I only want/need atomic operations to work between CPU and dGPU. It's OK if it only works on monotonic atomic ops.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the L1 and L2 caches.

nontemporal attribute does not work with atomic ops.

Would the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

Is there an mtype setting that would result in snooping PCIE transaction on L2 miss?

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

t-tye added a comment.Nov 2 2017, 8:38 PM

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

Is there a reason/use case to set it to streaming? Would it be too disruptive to change it?

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the L1 cache.

I don't want or need to implement full system memory coherence. I only want/need atomic operations to work between CPU and dGPU. It's OK if it only works on monotonic atomic ops.

You can use the ROCm runtime to allocate memory that will be fine grain coherent. This memory will be allocated in system pinned memory in the aperture with an UC mtype and so will bypass the L2 cache. This is the memory used to allocated the HSA user mode queues and HSA signals.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the L1 and L2 caches.

nontemporal attribute does not work with atomic ops.

Would the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

Is there an mtype setting that would result in snooping PCIE transaction on L2 miss?

My understanding is that a dGPU L2 miss will go over PCIE and access the memory fabric to get the value. This will snoop the CPUs to get the current authoritative value. A GPU L2 writeback (the result of an L2 cache writeback command) or bypass (a write with mtype=UC) will send a value over PCIE to the memory fabric which will invalidate the CPUs. A CPU write will not send a snoop over PCIE to invalidate the GPU L2.

The APUs can support mtype=CC which responds to CPU snoops.

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

jvesely abandoned this revision.Nov 27 2017, 9:17 AM

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

Is there a reason/use case to set it to streaming? Would it be too disruptive to change it?

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the L1 cache.

I don't want or need to implement full system memory coherence. I only want/need atomic operations to work between CPU and dGPU. It's OK if it only works on monotonic atomic ops.

You can use the ROCm runtime to allocate memory that will be fine grain coherent. This memory will be allocated in system pinned memory in the aperture with an UC mtype and so will bypass the L2 cache. This is the memory used to allocated the HSA user mode queues and HSA signals.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the L1 and L2 caches.

nontemporal attribute does not work with atomic ops.

Would the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

Is there an mtype setting that would result in snooping PCIE transaction on L2 miss?

My understanding is that a dGPU L2 miss will go over PCIE and access the memory fabric to get the value. This will snoop the CPUs to get the current authoritative value. A GPU L2 writeback (the result of an L2 cache writeback command) or bypass (a write with mtype=UC) will send a value over PCIE to the memory fabric which will invalidate the CPUs. A CPU write will not send a snoop over PCIE to invalidate the GPU L2.

The APUs can support mtype=CC which responds to CPU snoops.

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

thanks for all the info, I think I can work around this by modifying ROCT to allocate the region from UC memory on dGPUs.

thanks again,
Jan