Add interfaces for subtargets to return the number of write-combining buffers available. Also provide TTI interfaces that delegate to the subtarget interface.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
How do you imagine that we'd use this? Do we need some kind of size to go along with this?
See the Intel optimization guide, section 3.6.9.
Basically, this information can be used to inform loop transformations as well as use of non-temporal instructions. A write-combining buffer is not the same as a store buffer. A write-combining buffer is always one cache line in size, so I don't think we need size information.
Alright, thanks. First, we should document this in the interface. Instead of just saying:
\return the number of write-combining buffers.
we might say something like:
\return the number of write-combining buffers. A write-combining buffer is a per-core resource used for collecting writes to a particular cache line before further processing those writes using other parts of the memory subsystem.
we already have getCacheLineSize(), so we know how big that is, but we don't currently have a way to account for how many hardware threads per core, right? Don't we need that to estimate how many write-combining buffers we get for the current hardware thread? (Presumably, we'd want the same thing to use the total-cache-size functions too, because we need to generate code assuming a working-set size per thread?)
The Intel optimization guide talks about using this number to drive loop distribution, where we don't update more arrays (cache lines) at a time than can fit into the thread's WC buffers. Is this what you had in mind?
we might say something like:
\return the number of write-combining buffers. A write-combining buffer is a per-core resource used for collecting writes to a particular cache line before further processing those writes using other parts of the memory subsystem.
Will do.
we already have getCacheLineSize(), so we know how big that is, but we don't currently have a way to account for how many hardware threads per core, right? Don't we need that to estimate how many write-combining buffers we get for the current hardware thread? (Presumably, we'd want the same thing to use the total-cache-size functions too, because we need to generate code assuming a working-set size per thread?)
This is something that will become available once more bits of the system model are implemented. The model can specify things like number of cores and threads per core. The subtarget will be able to examine its execution resource configuration and return an appropriate number. After this patch makes it through I will be in a place where I can start posting the TableGen changes to generate models and then post the TableGen model classes after that. At that point targets can define their own models and away we go.
The Intel optimization guide talks about using this number to drive loop distribution, where we don't update more arrays (cache lines) at a time than can fit into the thread's WC buffers. Is this what you had in mind?
Yes. It's useful for anything that cares about performance of writes to memory.
Added comment explaining what a write-combining buffer is and one possibility of how to use this information.
Ok, I think I've addressed all the comments so this is ready for another review. Thanks!
@jdoerfert @hfinkel should I consider Johannes' comment a LGTM? This has been waiting for two months now.