This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add CPUCoherentL2 feature
AbandonedPublic

Authored by jvesely on Oct 26 2017, 3:58 PM.

Download Raw Diff

Details

Reviewers

arsenm
t-tye

Summary

This will be used to generate SLC flag for atomics on dGPUs

Diff Detail

Repository: rL LLVM

Event Timeline

jvesely created this revision.Oct 26 2017, 3:58 PM

Herald added subscribers: t-tye, tpr, dstuttard and 4 others. · View Herald TranscriptOct 26 2017, 3:58 PM

Is it going to be used with amdhsa os? Where can I find the spec for this?

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

This seems like a property of an individual operation, not a subtarget

In D39350#908697, @kzhuravl wrote:

Is it going to be used with amdhsa os? Where can I find the spec for this?

I planned to use this to enable system level atomics that work between CPU and dGPU. atm it only works on kaveri/carrizo.

In D39350#908854, @t-tye wrote:

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
Has this been changed?
I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
The ISA specs also don't mention what values are allowed in those 3 bits.

In D39350#908959, @arsenm wrote:

This seems like a property of an individual operation, not a subtarget

Why would that be? if I have a system scope atomic OP it needs SLC flag on dGPUs while it's OK on CC APUs.

In D39350#909426, @jvesely wrote:

In D39350#908697, @kzhuravl wrote:

Is it going to be used with amdhsa os? Where can I find the spec for this?

I planned to use this to enable system level atomics that work between CPU and dGPU. atm it only works on kaveri/carrizo.

In D39350#908854, @t-tye wrote:

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
Has this been changed?
I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
The ISA specs also don't mention what values are allowed in those 3 bits.

I do not think using SLC will achieve what you are looking for. The default memory policies for SLC are to enable STREAMING mode which leaves cache lines in the L2 cache and so will not achieve coherence. What you need is for the L2 cache to be kept coherent with the memory fabric which is what the MTYPE and IOMMUv2 can provide on APUs. The runtime can configure the hardware to do this. Buffer instructions use V# that can specify the MTYPE, and there are configuration registers that can be set for each aperture with the MTYPE to use.

What runtime are you intending to use to load and execute the code produced?

How where you thinking of controlling when to enable this as you would not want to affect the existing code generated as adding SLC will make code execute less performantly?

In D39350#909481, @t-tye wrote:

In D39350#909426, @jvesely wrote:

In D39350#908854, @t-tye wrote:

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
Has this been changed?
I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
The ISA specs also don't mention what values are allowed in those 3 bits.

I do not think using SLC will achieve what you are looking for. The default memory policies for SLC are to enable STREAMING mode which leaves cache lines in the L2 cache and so will not achieve coherence. What you need is for the L2 cache to be kept coherent with the memory fabric which is what the MTYPE and IOMMUv2 can provide on APUs. The runtime can configure the hardware to do this. Buffer instructions use V# that can specify the MTYPE, and there are configuration registers that can be set for each aperture with the MTYPE to use.

What runtime are you intending to use to load and execute the code produced?

How where you thinking of controlling when to enable this as you would not want to affect the existing code generated as adding SLC will make code execute less performantly?

For my specific use case (system calls) I use HCC, but I'd expect this to generally apply to system scope atomics.
The bug has been reported here [0], and I wrote simple atomic tests [1].

Configuring "System Level Coherent" flag to not guarantee system level coherence is a bit counter intuitive decision.
I'm not sure I follow the performance or MTYPE argument. Only system scope atomic operations need this, I'd expect them to be slow, and they should be used sparingly.
Can you point me to where MTYPE values are set (and documented)? I only found rsrc descriptor setup for scratch memory in ROCR.

[0] https://github.com/RadeonOpenCompute/hcc/issues/410
[1] https://github.com/jvesely/hcc-atomic-test

In D39350#909638, @jvesely wrote:

In D39350#909481, @t-tye wrote:

In D39350#909426, @jvesely wrote:

In D39350#908854, @t-tye wrote:

Are we sure using SLC is the way to achieve this? IIRC SLC can be used for streaming, but does not ensure L2 bypass. On an APU the MTYPE=CC specifies the memory policy that support coherence.

The CI and GCN3 ISA specs for SLC say: "System Level Coherent. When set, accesses are forced to miss in level 2 texture cache and are coherent with system memory."
Has this been changed?
I only found MTYPE references to for image and buffer rsrc, is there a way to set it for flat ops?
The ISA specs also don't mention what values are allowed in those 3 bits.

I do not think using SLC will achieve what you are looking for. The default memory policies for SLC are to enable STREAMING mode which leaves cache lines in the L2 cache and so will not achieve coherence. What you need is for the L2 cache to be kept coherent with the memory fabric which is what the MTYPE and IOMMUv2 can provide on APUs. The runtime can configure the hardware to do this. Buffer instructions use V# that can specify the MTYPE, and there are configuration registers that can be set for each aperture with the MTYPE to use.

What runtime are you intending to use to load and execute the code produced?

How where you thinking of controlling when to enable this as you would not want to affect the existing code generated as adding SLC will make code execute less performantly?

For my specific use case (system calls) I use HCC, but I'd expect this to generally apply to system scope atomics.
The bug has been reported here [0], and I wrote simple atomic tests [1].

Configuring "System Level Coherent" flag to not guarantee system level coherence is a bit counter intuitive decision.
I'm not sure I follow the performance or MTYPE argument. Only system scope atomic operations need this, I'd expect them to be slow, and they should be used sparingly.
Can you point me to where MTYPE values are set (and documented)? I only found rsrc descriptor setup for scratch memory in ROCR.

[0] https://github.com/RadeonOpenCompute/hcc/issues/410
[1] https://github.com/jvesely/hcc-atomic-test

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings. The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

This revision now requires changes to proceed.Oct 28 2017, 10:41 AM

In D39350#910078, @t-tye wrote:

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.
Woudl the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

In D39350#910115, @jvesely wrote:

In D39350#910078, @t-tye wrote:

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the L1 cache.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the L1 and L2 caches.

Woudl the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

In D39350#910199, @t-tye wrote:

In D39350#910115, @jvesely wrote:

In D39350#910078, @t-tye wrote:

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

Is there a reason/use case to set it to streaming? Would it be too disruptive to change it?

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the L1 cache.

I don't want or need to implement full system memory coherence. I only want/need atomic operations to work between CPU and dGPU. It's OK if it only works on monotonic atomic ops.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the L1 and L2 caches.

nontemporal attribute does not work with atomic ops.

Would the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

Is there an mtype setting that would result in snooping PCIE transaction on L2 miss?

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

In D39350#912734, @jvesely wrote:

In D39350#910199, @t-tye wrote:

In D39350#910115, @jvesely wrote:

In D39350#910078, @t-tye wrote:

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

Is there a reason/use case to set it to streaming? Would it be too disruptive to change it?

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the L1 cache.

I don't want or need to implement full system memory coherence. I only want/need atomic operations to work between CPU and dGPU. It's OK if it only works on monotonic atomic ops.

You can use the ROCm runtime to allocate memory that will be fine grain coherent. This memory will be allocated in system pinned memory in the aperture with an UC mtype and so will bypass the L2 cache. This is the memory used to allocated the HSA user mode queues and HSA signals.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the L1 and L2 caches.

nontemporal attribute does not work with atomic ops.

Would the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

Is there an mtype setting that would result in snooping PCIE transaction on L2 miss?

My understanding is that a dGPU L2 miss will go over PCIE and access the memory fabric to get the value. This will snoop the CPUs to get the current authoritative value. A GPU L2 writeback (the result of an L2 cache writeback command) or bypass (a write with mtype=UC) will send a value over PCIE to the memory fabric which will invalidate the CPUs. A CPU write will not send a snoop over PCIE to invalidate the GPU L2.

The APUs can support mtype=CC which responds to CPU snoops.

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

In D39350#914821, @t-tye wrote:

In D39350#912734, @jvesely wrote:

In D39350#910199, @t-tye wrote:

In D39350#910115, @jvesely wrote:

In D39350#910078, @t-tye wrote:

The cache policy is determined from the address_mode the GPU is configured in (GPUVM32, GPUVM64, HSA32 or HSA64), the memory path used (GPUVM or ATC/IOMMU-only available on APUs), the MTYPE of the access (UC-uncached, NC-non-coherent, NC_NV-non-coherent, non-volatile, and CC-cache coherent [only supported in HSA* address_code on the APC/IOMMU memory path]), the instruction kind (vector/scalar load/store/atomic), and the GLC/SLC bits in the instruction.

For instructions that use a buffer descriptor, the memory path and MTYPE is specified in the descriptor. For other instructions (FLAT*, scalar without a V#) the MTYPE is determined based on the aperture that the virtual address falls in. Depending on the address_mode there are up to 5 apertures (gpuvm, APE1, shared, private, default). The apertures are configured with the base/limit virtual address, the MTYPE and memory path.

To get the L2 to behave in a coherent manner either the UC or CC MTYPES can be used to ensure fine grain coherence (at instruction granularity), or the L2 can be writtenback/invalidated explicitly at dispatch boundaries. The L1 caches also need to be managed explicitly (which is done for the LLVM atomics as mentioned below).

The GLC and SLC bits jointly determine the cache hit policy MISS-(L1 only)/HIT and cache retention policy LRU/EVICT-(L1 only)/STREAM-(L2 only) for L1 and L2. For example, the AMDGCN backend implements setting SLC to implement LLVMs nontemporal attribute on non-atomic memory operations to cause the L2 STREAM policy to be used (note that STREAM is not the same as bypass).

Making some instructions bypass L2 only gets you the C++ relaxed atomic semantics as it does not ensure that the other memory operations are made visible as required by the C++ acquire/release/seq_cst memory orderings.

which is exactly what I aim to achieve, the second step after this patch, is to modify SIMemoryLegalizer and add SLC bit to system scope atomic ops. I don't care about other memory accesses, just that the value of atomic variable is coherent between CPU and GPU.

Setting the SLC bit will not achieve this. It will result in the STREAMING cache policy which will hit on lines already in the L2 which is not the same as BYPASS.

Is there a reason/use case to set it to streaming? Would it be too disruptive to change it?

In order to implement system coherency you also need to consider the other memory operations and how you will make them coherent. The C++ memory model requires that synchronizing through an atomic will also make all the non-atomic memory operations visible, and from ISA there is no way to writeback or invalidate the L2 cache as there is for the L1 cache.

I don't want or need to implement full system memory coherence. I only want/need atomic operations to work between CPU and dGPU. It's OK if it only works on monotonic atomic ops.

You can use the ROCm runtime to allocate memory that will be fine grain coherent. This memory will be allocated in system pinned memory in the aperture with an UC mtype and so will bypass the L2 cache. This is the memory used to allocated the HSA user mode queues and HSA signals.

You can use the nontemporal attribute to cause the current SIMemoryLegalizer to put the SLC bit on non-atomic memory operations. This allows memory operations to be used but minimize polluting the L1 and L2 caches.

nontemporal attribute does not work with atomic ops.

Would the GPU issue cache-snooping PCIE transaction, or do I need to handle CPU caches manually?

The CC MTYPE does cause the GPU L2 to listen to snoop requests. So writes by the CPU will invalidate GPU L2 cache lines. It does not cause the GPU to issue snoops so writes in the GPU will not invalidate CPU lines. CC is only available on APUs not dGPUs.

Is there an mtype setting that would result in snooping PCIE transaction on L2 miss?

My understanding is that a dGPU L2 miss will go over PCIE and access the memory fabric to get the value. This will snoop the CPUs to get the current authoritative value. A GPU L2 writeback (the result of an L2 cache writeback command) or bypass (a write with mtype=UC) will send a value over PCIE to the memory fabric which will invalidate the CPUs. A CPU write will not send a snoop over PCIE to invalidate the GPU L2.

The APUs can support mtype=CC which responds to CPU snoops.

The AMDGCN backend implements the LLVM memory model by setting the GLC bit, using the L1 cache invalidate and inserting waitcnt instructions appropriately (see [0] for more information). It relies on the runtime/driver to manage the L2 cache by setting the address_mode and apertures so the appropriate MTYPE/memory_path is used, or by explicit writeback/invalidate at dispatch boundaries.

The runtime/driver may choose to provide memory allocators that return virtual addresses that will fall in the different apertures that it has configured to use different MTYPEs or memory paths. This can allow some allocations to be coherent and others not. There may be a trade off between coherence and performance. For example, accesses that result in using an MTYPE that bypasses the L2 may result in lower performance than those that use the L2.

There are some other details but hopefully the above is helpful and explains why using SLC will not achieve the goal of making CPU and GPU memory coherent.

[0] https://llvm.org/docs/AMDGPUUsage.html#memory-model

Thank you. it was helpful.
If I understood correctly, since there are no ISA level L2 maintenance ops, the only way to achieve any coherence between CPU and dGPU is to bypass L2 on all memory accesses, which is expected for any PCIE device.

On an APU the CC MTYPE can be used which allows reads to still be cached in the L2, but writes to writethough the L2. On a dGPU the UC MTYPE can be used on the memory allocations that want to be system coherent, resulting in bypassing the L2. Not all memory allocations have to be UC so for allocations that do not require system coherence can still use NC and have the benefits of the L2.

On GFX9 the MTYPE is now managed in the page table entries. This is more flexible as it eliminates the need for the fixed aperture configuration.

thanks for all the info, I think I can work around this by modifying ROCT to allocate the region from UC memory on dGPUs.

thanks again,
Jan

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

6 lines

5 lines

1 line

4 lines

Diff 120505

lib/Target/AMDGPU/AMDGPU.td

	Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines
	>;			>;

	def FeatureIntClamp : SubtargetFeature<"int-clamp-insts",			def FeatureIntClamp : SubtargetFeature<"int-clamp-insts",
	"HasIntClamp",			"HasIntClamp",
	"true",			"true",
	"Support clamp for integer destination"			"Support clamp for integer destination"
	>;			>;

				def FeatureCPUCoherentL2 : SubtargetFeature<"cpu-coherent-l2",
				"HasCPUCoherentL2",
				"true",
				"GPU L2 cache can be configured to be CPU coherent in HSA mode"
				>;

	//===------------------------------------------------------------===//			//===------------------------------------------------------------===//
	// Subtarget Features (options and debugging)			// Subtarget Features (options and debugging)
	//===------------------------------------------------------------===//			//===------------------------------------------------------------===//

	// Some instructions do not support denormals despite this flag. Using			// Some instructions do not support denormals despite this flag. Using
	// fp32 denormals also causes instructions to run at the double			// fp32 denormals also causes instructions to run at the double
	// precision rate for the device.			// precision rate for the device.
	def FeatureFP32Denormals : SubtargetFeature<"fp32-denormals",			def FeatureFP32Denormals : SubtargetFeature<"fp32-denormals",
	▲ Show 20 Lines • Show All 483 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 165 Lines • ▼ Show 20 Lines	protected:
bool FlatScratchInsts;		bool FlatScratchInsts;
bool AddNoCarryInsts;		bool AddNoCarryInsts;
bool R600ALUInst;		bool R600ALUInst;
bool CaymanISA;		bool CaymanISA;
bool CFALUBug;		bool CFALUBug;
bool HasVertexCache;		bool HasVertexCache;
short TexVTXClauseSize;		short TexVTXClauseSize;
bool ScalarizeGlobal;		bool ScalarizeGlobal;
		bool HasCPUCoherentL2;

// Dummy feature to use for assembler in tablegen.		// Dummy feature to use for assembler in tablegen.
bool FeatureDisable;		bool FeatureDisable;

InstrItineraryData InstrItins;		InstrItineraryData InstrItins;
SelectionDAGTargetInfo TSInfo;		SelectionDAGTargetInfo TSInfo;
AMDGPUAS AS;		AMDGPUAS AS;

▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	public:
bool hasCaymanISA() const {		bool hasCaymanISA() const {
return CaymanISA;		return CaymanISA;
}		}

bool hasFMA() const {		bool hasFMA() const {
return (getGeneration() >= EVERGREEN) && hasHWFP64();		return (getGeneration() >= EVERGREEN) && hasHWFP64();
}		}

		bool hasCPUCoherentL2() const {
		return HasCPUCoherentL2;
		}

TrapHandlerAbi getTrapHandlerAbi() const {		TrapHandlerAbi getTrapHandlerAbi() const {
return isAmdHsaOS() ? TrapHandlerAbiHsa : TrapHandlerAbiNone;		return isAmdHsaOS() ? TrapHandlerAbiHsa : TrapHandlerAbiNone;
}		}

bool isPromoteAllocaEnabled() const {		bool isPromoteAllocaEnabled() const {
return EnablePromoteAlloca;		return EnablePromoteAlloca;
}		}

▲ Show 20 Lines • Show All 562 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSubtarget.cpp

Show First 20 Lines • Show All 144 Lines • ▼ Show 20 Lines	: AMDGPUGenSubtargetInfo(TT, GPU, FS),
HasInv2PiInlineImm(false),		HasInv2PiInlineImm(false),
HasSDWA(false),		HasSDWA(false),
HasSDWAOmod(false),		HasSDWAOmod(false),
HasSDWAScalar(false),		HasSDWAScalar(false),
HasSDWASdst(false),		HasSDWASdst(false),
HasSDWAMac(false),		HasSDWAMac(false),
HasSDWAOutModsVOPC(false),		HasSDWAOutModsVOPC(false),
HasDPP(false),		HasDPP(false),
		HasCPUCoherentL2(false),
FlatAddressSpace(false),		FlatAddressSpace(false),
FlatInstOffsets(false),		FlatInstOffsets(false),
FlatGlobalInsts(false),		FlatGlobalInsts(false),
FlatScratchInsts(false),		FlatScratchInsts(false),
AddNoCarryInsts(false),		AddNoCarryInsts(false),

R600ALUInst(false),		R600ALUInst(false),
CaymanISA(false),		CaymanISA(false),
▲ Show 20 Lines • Show All 439 Lines • Show Last 20 Lines

lib/Target/AMDGPU/Processors.td

Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines	def : ProcessorModel<"gfx700", SIQuarterSpeedModel,
[FeatureISAVersion7_0_0]		[FeatureISAVersion7_0_0]
>;		>;

def : ProcessorModel<"bonaire", SIQuarterSpeedModel,		def : ProcessorModel<"bonaire", SIQuarterSpeedModel,
[FeatureISAVersion7_0_0]		[FeatureISAVersion7_0_0]
>;		>;

def : ProcessorModel<"kaveri", SIQuarterSpeedModel,		def : ProcessorModel<"kaveri", SIQuarterSpeedModel,
[FeatureISAVersion7_0_0]		[FeatureISAVersion7_0_0, FeatureCPUCoherentL2]
>;		>;

def : ProcessorModel<"gfx701", SIFullSpeedModel,		def : ProcessorModel<"gfx701", SIFullSpeedModel,
[FeatureISAVersion7_0_1]		[FeatureISAVersion7_0_1]
>;		>;

def : ProcessorModel<"hawaii", SIFullSpeedModel,		def : ProcessorModel<"hawaii", SIFullSpeedModel,
[FeatureISAVersion7_0_1]		[FeatureISAVersion7_0_1]
Show All 22 Lines	def : ProcessorModel<"tonga", SIQuarterSpeedModel,
[FeatureISAVersion8_0_2]		[FeatureISAVersion8_0_2]
>;		>;

def : ProcessorModel<"iceland", SIQuarterSpeedModel,		def : ProcessorModel<"iceland", SIQuarterSpeedModel,
[FeatureISAVersion8_0_0]		[FeatureISAVersion8_0_0]
>;		>;

def : ProcessorModel<"carrizo", SIQuarterSpeedModel,		def : ProcessorModel<"carrizo", SIQuarterSpeedModel,
[FeatureISAVersion8_0_1]		[FeatureISAVersion8_0_1, FeatureCPUCoherentL2]
>;		>;

def : ProcessorModel<"fiji", SIQuarterSpeedModel,		def : ProcessorModel<"fiji", SIQuarterSpeedModel,
[FeatureISAVersion8_0_3]		[FeatureISAVersion8_0_3]
>;		>;

def : ProcessorModel<"stoney", SIQuarterSpeedModel,		def : ProcessorModel<"stoney", SIQuarterSpeedModel,
[FeatureISAVersion8_1_0]		[FeatureISAVersion8_1_0]
▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines