This is an archive of the discontinued LLVM Phabricator instance.

[mlir][spirv] Convert tensor.extract for very small tensors
ClosedPublic

Authored by antiagainst on Mar 5 2021, 8:23 AM.

Details

Summary

Normally tensors will be stored in buffers before converting to SPIR-V,
given that is how a large amount of data is sent to the GPU. However,
SPIR-V supports converting from tensors directly too. This is for the
cases where the tensor just contains a small amount of elements and it
makes sense to directly inline them as a small data array in the shader.
To handle this, internally the conversion might create new local
variables. SPIR-V consumers in GPU drivers may or may not optimize that
away. So this has implications over register pressure. Therefore, a
threshold is used to control when the patterns should kick in.

Diff Detail

Event Timeline

antiagainst created this revision.Mar 5 2021, 8:23 AM
antiagainst requested review of this revision.Mar 5 2021, 8:23 AM

The code change looks fine to me, however I wonder if we really want to support translating directly tensors to SPIR-V on the long term. This doesn't look well aligned with similar lowering to llvm and the same code generation could be done by lowering the tensor earlier. You are more familiar with this code than am I so maybe I'm missing some reasons why it is better to handle it that way but I would assume getting rid of all tensors before converting to SPIR-V would be a better long term solution.
Also in term of performance I'm not sure this is the most efficient way to handle this case. I think having the constants in a uniform buffer would be more efficient on most GPUs than creating a private memory array and indexing into it. (unless the indexes end up being constant after optimization and the value can be stored in registers). In general GPUs would have optimizations for uniform buffers while private arrays are duplicated for each work items.

The code change looks fine to me, however I wonder if we really want to support translating directly tensors to SPIR-V on the long term. This doesn't look well aligned with similar lowering to llvm and the same code generation could be done by lowering the tensor earlier. You are more familiar with this code than am I so maybe I'm missing some reasons why it is better to handle it that way but I would assume getting rid of all tensors before converting to SPIR-V would be a better long term solution.
Also in term of performance I'm not sure this is the most efficient way to handle this case. I think having the constants in a uniform buffer would be more efficient on most GPUs than creating a private memory array and indexing into it. (unless the indexes end up being constant after optimization and the value can be stored in registers). In general GPUs would have optimizations for uniform buffers while private arrays are duplicated for each work items.

Yup certainly. This is not a general applicable pattern for converting all tensors of all sizes. It's actually very restricted: for constant tensors with a very small number of elements (like < 8 elements or something) when we want to just inline the data array inside the shader. (Sort of like you have a data table where you query its value using indices.) This is also what I'm trying to make very clear by the comments and the special byteCountThreshold control. Creating a dedicated buffer for such a small number of elements may not really worth it.

Use private storage

(nit: your revision description has a typo "Suppot")

antiagainst retitled this revision from [spirv] Suppot vector.extract conversion to [mlir][spirv] Convert tensor.extract for very small tensors.Mar 5 2021, 10:57 AM

(nit: your revision description has a typo "Suppot")

Ah yeah. :D Fixed, thanks!

Switch back to Function storage class for now

ThomasRaoux accepted this revision.Mar 5 2021, 11:04 AM

This is also what I'm trying to make very clear by the comments and the special byteCountThreshold control. Creating a dedicated buffer for such a small number of elements may not really worth it.

I suspect the driver will end up storing it in memory as most GPUs cannot index on the register file. This would end up using even more memory as each work item would allocate memory unless the driver optimizes it and prevent from using the potential constant cache. Some drivers may optimize it to a bunch of select ops but I wonder if we want to rely on driver for that.

I don't mean to block progress and I think the code is fine. It would be good on the future to figure out if we really want to allow keeping tensors all the way to SPIR-V conversion.

This revision is now accepted and ready to land.Mar 5 2021, 11:04 AM

This is also what I'm trying to make very clear by the comments and the special byteCountThreshold control. Creating a dedicated buffer for such a small number of elements may not really worth it.

I suspect the driver will end up storing it in memory as most GPUs cannot index on the register file. This would end up using even more memory as each work item would allocate memory unless the driver optimizes it and prevent from using the potential constant cache. Some drivers may optimize it to a bunch of select ops but I wonder if we want to rely on driver for that.

I don't mean to block progress and I think the code is fine. It would be good on the future to figure out if we really want to allow keeping tensors all the way to SPIR-V conversion.

+1. True. In general I would do whatever we can do at SPIR-V level to reduce the reliance on the driver compiler. Actually we should at least be using Private storage here given this is not specific to the work item. However right now SPIR-V global variable with Private storage class does not support initializers well. I need to fix that. Planning to do it later. In the long run I think we can inject patterns to convert to vector first before SPIR-V to make it more structured. But still need to convert to !spv.array given there aren't native vector support with size > 4.

This is also what I'm trying to make very clear by the comments and the special byteCountThreshold control. Creating a dedicated buffer for such a small number of elements may not really worth it.

I suspect the driver will end up storing it in memory as most GPUs cannot index on the register file. This would end up using even more memory as each work item would allocate memory unless the driver optimizes it and prevent from using the potential constant cache. Some drivers may optimize it to a bunch of select ops but I wonder if we want to rely on driver for that.

I don't mean to block progress and I think the code is fine. It would be good on the future to figure out if we really want to allow keeping tensors all the way to SPIR-V conversion.

+1. True. In general I would do whatever we can do at SPIR-V level to reduce the reliance on the driver compiler. Actually we should at least be using Private storage here given this is not specific to the work item. However right now SPIR-V global variable with Private storage class does not support initializers well. I need to fix that. Planning to do it later. In the long run I think we can inject patterns to convert to vector first before SPIR-V to make it more structured. But still need to convert to !spv.array given there aren't native vector support with size > 4.

Sounds like a good plan to me. My suggestion would be for very small array to lower to a bunch of cmp/select and larger array put it in uniform buffer. I think using private memory for constant data is almost always going to be suboptimal. Private is still per work item as far as I understand. I think it only makes sense to use it when we need to index on non-constant data and in my experience will have significant cost as it goes to memory.

+1. True. In general I would do whatever we can do at SPIR-V level to reduce the reliance on the driver compiler. Actually we should at least be using Private storage here given this is not specific to the work item. However right now SPIR-V global variable with Private storage class does not support initializers well. I need to fix that. Planning to do it later. In the long run I think we can inject patterns to convert to vector first before SPIR-V to make it more structured. But still need to convert to !spv.array given there aren't native vector support with size > 4.

Sounds like a good plan to me. My suggestion would be for very small array to lower to a bunch of cmp/select and larger array put it in uniform buffer. I think using private memory for constant data is almost always going to be suboptimal. Private is still per work item as far as I understand. I think it only makes sense to use it when we need to index on non-constant data and in my experience will have significant cost as it goes to memory.

Yup. Sorry I didn't complete the sentence enough; ends up it's confusing I think: "... not specific to the work item [in order to give the driver compiler more direct hints that this is global static const array so that it can handle it better]." We should also attach NonWritable decoration to it. NonWritable's semantics was explicitly relaxed to support this kind of constant data array look up thing. (We have similar needs on graphics side too. It has been a few years this was addressed. So the GPU drivers should be somewhat reliable to optimize it. ;-P) It's just that NonWritable cannot be attached to Private/Function storage objects before SPIR-V 1.4.. (Vulkan 1.1 natively just supports SPIR-V 1.3. There is an extension VK_KHR_spirv_1_4 makes SPIR-V 1.4 to Vulkan 1.1 though, which is getting adoption on Android.) Looks like we can throw in two patterns, one with the NonWritable decoration (and higher priority) and other one without, and let the target environment to filter the suitable one to use. (But again, may need to fix the issue mentioned here first. Quite a few missing things. ;-P) Generating cmp/select instruction chains is certainly another interesting direction and might be better!

+1. True. In general I would do whatever we can do at SPIR-V level to reduce the reliance on the driver compiler. Actually we should at least be using Private storage here given this is not specific to the work item. However right now SPIR-V global variable with Private storage class does not support initializers well. I need to fix that. Planning to do it later. In the long run I think we can inject patterns to convert to vector first before SPIR-V to make it more structured. But still need to convert to !spv.array given there aren't native vector support with size > 4.

Sounds like a good plan to me. My suggestion would be for very small array to lower to a bunch of cmp/select and larger array put it in uniform buffer. I think using private memory for constant data is almost always going to be suboptimal. Private is still per work item as far as I understand. I think it only makes sense to use it when we need to index on non-constant data and in my experience will have significant cost as it goes to memory.

Yup. Sorry I didn't complete the sentence enough; ends up it's confusing I think: "... not specific to the work item [in order to give the driver compiler more direct hints that this is global static const array so that it can handle it better]." We should also attach NonWritable decoration to it. NonWritable's semantics was explicitly relaxed to support this kind of constant data array look up thing. (We have similar needs on graphics side too. It has been a few years this was addressed. So the GPU drivers should be somewhat reliable to optimize it. ;-P) It's just that NonWritable cannot be attached to Private/Function storage objects before SPIR-V 1.4.. (Vulkan 1.1 natively just supports SPIR-V 1.3. There is an extension VK_KHR_spirv_1_4 makes SPIR-V 1.4 to Vulkan 1.1 though, which is getting adoption on Android.) Looks like we can throw in two patterns, one with the NonWritable decoration (and higher priority) and other one without, and let the target environment to filter the suitable one to use. (But again, may need to fix the issue mentioned here first. Quite a few missing things. ;-P) Generating cmp/select instruction chains is certainly another interesting direction and might be better!

Ha good point, I think the NonWritable should work, it would be equivalent to the inline constant buffer of DX and the global constants of OCL. I'm pretty sure the driver just convert it into a constant buffer though :) Anyway this is definitely good enough for now.