This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Use processing unit names for `thread_dim_map` and `mapped to dims`
AbandonedPublic

Authored by guraypp on Oct 27 2022, 8:59 AM.

Details

Summary

Currently using thread_dim_map and a directory array map to dims. This is very confusing in many ways. This change uses meaningful words in these structures.

For now there is thread_x/y/z and block_x/y/z. They cannot be mixed in the same foreach_thread. However, it is possible to mix them or use them together in some cases in the future.

The change is almost NFC other than changing the names.

Diff Detail

Event Timeline

guraypp created this revision.Oct 27 2022, 8:59 AM
guraypp requested review of this revision.Oct 27 2022, 8:59 AM
guraypp edited the summary of this revision. (Show Details)Oct 27 2022, 9:00 AM
guraypp updated this revision to Diff 471225.Oct 27 2022, 10:42 AM
guraypp edited the summary of this revision. (Show Details)

update the description

guraypp updated this revision to Diff 471419.Oct 28 2022, 12:14 AM

remove comments

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

mlir/include/mlir/Dialect/Utils/StaticValueUtils.h
61

Nit: do not specify the number of stack elements in the vector unless you have a strong reason to.

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
298–299

This seems orthogonal to the purpose of the change, can you please comment why this is necessary?

mlir/lib/Dialect/SCF/IR/SCF.cpp
1073–1081

Why does it have to be a bit map? Are we expecting to use a mapping that is simultaneously thread-x, thread-y and block-z, for example?

nicolasvasilache added a comment.EditedOct 28 2022, 1:27 AM

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

One aspect here is that the op semantics and transformations should be retargetable.
Is there a way to have something like a BaseForeachThreadTargetEnumAttr for which the GPU dialect could define Thread/Block; the IREE xxx dialect could define Workgroupyy and future ACME dialect could define its own?

A (far from) ideal solution would be to have users define their own names "foo", "bar", "baz" and explicitly have the map_to_target specify the map ["foo" -> threadIdx.x ...].
This is so ugly that I don't think it should be considered but that should give a feeling on what a better preferred way to interact with the system could look like.

guraypp updated this revision to Diff 471436.Oct 28 2022, 1:28 AM

remove size of the vector

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

I totally agree with you. Actually I created the enum array at the beginning. But`foreach_thread` is manually parsed, printed, verified. So I had to parse enum array manually here, it looked also unsafe to me. Therefore, I choose string array option.
But If you have a strong argument, I can work on changing to enum.

mlir/include/mlir/Dialect/Utils/StaticValueUtils.h
61

I removed the size here.

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
298–299

Good catch :) This is a minor bug fix.

Without this line, blockDim always required 3 dimensions. It was annoying when foreach_thread contains less than 3 loops. This line will set 1 to unused dimensions.

mlir/lib/Dialect/SCF/IR/SCF.cpp
1073–1081

Exactly. One can think about mixings blocks and threads like below. This will be my future work.

loop1 --> blockIdx.x, threadIdx.x
loop2 --> blockIdx.y, threadIdx.y

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

One aspect here is that the op semantics and transformations should be retargetable.
Is there a way to have something like a BaseForeachThreadTargetEnumAttr for which the GPU dialect could define Thread/Block; the IREE xxx dialect could define Workgroupyy and future ACME dialect could define its own?

A (far from) ideal solution would be to have users define their own names "foo", "bar", "baz" and explicitly have the map_to_target specify the map ["foo" -> threadIdx.x ...].
This is so ugly that I don't think it should be considered but that should give a feeling on what a better preferred way to interact with the system could look like.

If you don't constrain the kind of attributes in the op verifier, any downstream client can use whatever they want there. Lowerings can then check if they support the specific attribute kind and fail early if they don't. If you want more guarantees, introduce an attribute interface. There is no need to invent new infrastructure layers here.

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

One aspect here is that the op semantics and transformations should be retargetable.
Is there a way to have something like a BaseForeachThreadTargetEnumAttr for which the GPU dialect could define Thread/Block; the IREE xxx dialect could define Workgroupyy and future ACME dialect could define its own?

A (far from) ideal solution would be to have users define their own names "foo", "bar", "baz" and explicitly have the map_to_target specify the map ["foo" -> threadIdx.x ...].
This is so ugly that I don't think it should be considered but that should give a feeling on what a better preferred way to interact with the system could look like.

If you don't constrain the kind of attributes in the op verifier, any downstream client can use whatever they want there. Lowerings can then check if they support the specific attribute kind and fail early if they don't. If you want more guarantees, introduce an attribute interface. There is no need to invent new infrastructure layers here.

Cool thnks for the guidnce, I had not followed that part of the stack.

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

One aspect here is that the op semantics and transformations should be retargetable.
Is there a way to have something like a BaseForeachThreadTargetEnumAttr for which the GPU dialect could define Thread/Block; the IREE xxx dialect could define Workgroupyy and future ACME dialect could define its own?

A (far from) ideal solution would be to have users define their own names "foo", "bar", "baz" and explicitly have the map_to_target specify the map ["foo" -> threadIdx.x ...].
This is so ugly that I don't think it should be considered but that should give a feeling on what a better preferred way to interact with the system could look like.

If you don't constrain the kind of attributes in the op verifier, any downstream client can use whatever they want there. Lowerings can then check if they support the specific attribute kind and fail early if they don't. If you want more guarantees, introduce an attribute interface. There is no need to invent new infrastructure layers here.

Cool thnks for the guidnce, I had not followed that part of the stack.

Then I have a proposal. As Alex said, let's relex verifier of foreach_thread, for example as follows (here I improved the syntax a bit).

scf.foreach_thread (%bi, %bj)  {
  scf.foreach_thread (%ti, %tj) {
  }  {map = "unit", dimensions = ["x", "y"]}
}  {map = "group", dimensions = ["x", "y"]}

Next, let's add a new generic op to the GPU dialect. One can map named parallelism unit to block and thread.

// transform dialect for GPU
transform.gpu.map_foreach %op "group" to blocks
transform.gpu.map_foreach %op "unit" to threads

Another accelerator transform dialect uses only the outer level parallelism for example.

// transform dialect for other accelerator
transform.other_accelerator.map_foreach %op "group" to threads

Then I have a proposal. As Alex said, let's relex verifier of foreach_thread, for example as follows (here I improved the syntax a bit).

scf.foreach_thread (%bi, %bj)  {
  scf.foreach_thread (%ti, %tj) {
  }  {map = "unit", dimensions = ["x", "y"]}
}  {map = "group", dimensions = ["x", "y"]}

Next, let's add a new generic op to the GPU dialect. One can map named parallelism unit to block and thread.

// transform dialect for GPU
transform.gpu.map_foreach %op "group" to blocks
transform.gpu.map_foreach %op "unit" to threads

Another accelerator transform dialect uses only the outer level parallelism for example.

// transform dialect for other accelerator
transform.other_accelerator.map_foreach %op "group" to threads

I will need to also change tile_to_foreach_thread_op to something like below.

transform.structured.tile_to_foreach_thread_op %op (map = "unit", dimensions = ["x", "y"])
ftynse added a comment.Nov 2 2022, 5:24 AM

Then I have a proposal. As Alex said, let's relex verifier of foreach_thread, for example as follows (here I improved the syntax a bit).

scf.foreach_thread (%bi, %bj)  {
  scf.foreach_thread (%ti, %tj) {
  }  {map = "unit", dimensions = ["x", "y"]}
}  {map = "group", dimensions = ["x", "y"]}

Next, let's add a new generic op to the GPU dialect. One can map named parallelism unit to block and thread.

// transform dialect for GPU
transform.gpu.map_foreach %op "group" to blocks
transform.gpu.map_foreach %op "unit" to threads

Another accelerator transform dialect uses only the outer level parallelism for example.

// transform dialect for other accelerator
transform.other_accelerator.map_foreach %op "group" to threads

I think having string attributes and then binding them to something in extra operations is an overkill and also doesn't advance us in the right direction. If you watch some of the earlier MLIR presentations, we specifically wanted to avoid MLIR becoming a "JSON of compiler IRs". String attributes loosely bound by an operation located elsewhere has exactly that JSON feeling for me. What I propose concretely is to introduce a DeviceMappingAttrInterface, initially with no methods at all, just as a unification mechanism. Then, have a GPUThreadMappingAttr and GPUBlockMappingAttr, potentially sharing parts of the implementation, that implement the DeviceMappingAttrInterface. These can print as #gpu.threads<y, z> and #gpu.blocks<x, y>, respectively. Other accelerators can then introduce their own attributes #other_accelerator.mapping<parallel_hw_feature_no_7> that also implement the interface. The verifier of foreach_thread can then check that the mapping attribute is DeviceMappingAttrInterface (or an array thereof depending on the desired model for mapping to more than one parallel HW dimension). Initially, the lowering pattern to GPU can do an additional isa<GPUThreadMappingAttr, GPUBlockMappingAttr>() on the attribute and fail to match otherwise, with a helpful debug message. Other accelerators can do the same. In a slightly longer term, we can consider moving HW-specific parts of the lowering into interface methods so we can keep a common lowering, but this is not a priority.

@ftynse thanks for clear explanation. As it is different that this current PR, I implement in another in D137413. Let me know what do you think.

nicolasvasilache resigned from this revision.Nov 10 2022, 8:35 AM

I think we can now abandon this ?

guraypp abandoned this revision.Nov 10 2022, 8:41 AM

We implemented a better solution in https://reviews.llvm.org/D137413