Allow lowering of wmma ops with 64bits indexes
Details
Diff Detail
Event Timeline
This looks good to me - thanks.
mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp | ||
---|---|---|
187–189 | I assume this explanation wasn't a valid one. Could you clarify? | |
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir | ||
13–23 | The default has changed to 64-bit here: could you please capture this in the commit summary? | |
mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir | ||
3 | This can be run with a 32-bit index bitwidth. Do we need this change? |
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir | ||
---|---|---|
13–23 | I now realize that the default now just works as opposed to failing: it's just derived from the data layout and would happen to be 64-bit by default. |
Thanks for the review!
mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp | ||
---|---|---|
187–189 | Yes, even though the intrinsic only accept 32bits stride those ops can be used with 64bits indexes. | |
mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir | ||
3 | Correct it can run with 32bits index but most tests use the default 64bit mode. Overall I don't believe that the 32bit mode is as relevant for CUDA. Let me know if you want me to keep an integration test in 32bit mode, note that I'm still testing both mode in the lit tests. |
mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir | ||
---|---|---|
3 | Reg. 32-bit mode's relevance for CUDA, my understanding is exactly the opposite. The 32-bit indexing is quite important and common: it's often sufficient and is expected to only perform better. I recall @herhut or others from his team reporting on the MLIR forum that they saw better performance in many cases when using 32-bit indexing. So I think it's completely fine to use 32-bit indexing for cases where it's guaranteed to be sufficient. |
auto -> IntegerAttr here.