This is an archive of the discontinued LLVM Phabricator instance.

[mlir][LinAlg][Transform][GPU] Add GPU memory hierarchy to the transform.promote op
ClosedPublic

Authored by tavakkoliamirmohammad on Feb 23 2023, 12:11 PM.

Details

Summary

In this patch we are adding the support of copying a a memref.subview to the shared or private memory in GPU. The global to shared memory copy is adopted from codes implemented in IREE (https://github.com/iree-org/iree), but the private memory copy part has not been implemented in IREE. This patch enables transferring a subview from global->shared, global->private, and shared->private.

Our final aim is to provide a copy layout as an affine map to the transform.promote op to support transpose memory copy. This map is a permutation of the original affine index map. Although this has been implemented and user can copy data to arbitrary layout , this attempt is not included in this patch since we have still problem with linalg.generic operations to change their index map to the transformed index map. You can find more in following links (Initial attempt to support layout map in promote op in transform dialect) (Fix data transpose in shared memory)

Diff Detail

Event Timeline

tavakkoliamirmohammad requested review of this revision.Feb 23 2023, 12:11 PM

Format the file and fix wrong diff file not including the private memory copy

ftynse requested changes to this revision.Feb 23 2023, 2:31 PM
ftynse added inline comments.
mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
1784

Functions that are only necessary in this translation unit should be declared static.

Top-level entities such as function must be documented.

I would also consider placing that somewhere in Linalg/Transform/ rather than in TransformOps.

1796

Nit: don't specify the number of stack elements for the vector.

1804

It's a good practice to use OpBuilder:InsertionGuard to restore the location before the function returns.

1821

We don't commit commented-out code.

1847

This function only differs from the one above by not using the shared address space here. This is something that can be generalized.

We also have an explicit address space for private memory on GPUs. Likely, it should be used here instead of the default address space.

1880–1894

This would do strange things if both attributes are set. Have you considered having a single attribute that specifies which kind of memory to use?

This revision now requires changes to proceed.Feb 23 2023, 2:31 PM
mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
1804

This is done in line 1788. Is there anything needed to reset the insertion point? since by destroying the OpBuilder::InsertionGuard guard the insertion point is restored

tavakkoliamirmohammad marked 6 inline comments as done.

Added a new device mapping enum for GPU memory hierarchy, removed comments, refactor allocation, deallocation, and copy function to linalg/Transform/promotion.cpp

ftynse requested changes to this revision.Feb 24 2023, 11:54 AM

Nice, almost there, thanks you!

mlir/include/mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td
99

Nit: we tend to prefer OpenCL-style terminology, so s/shared/workgroup (which is the actual name of the attribute). s/thread/workitem, s/block/workgroup.

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
396

Nit: top-level comments in MLIR should be prefixed with ///. Also, terminate sentences in documentation with a period.

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
1804

We don't see this line in the review. Please reupload the patch with more context git diff -U99999 or use Arcanist https://llvm.org/docs/Phabricator.html.

1808

Nit: MLIR uses camelCase for variable names.

1808–1812

Nit: any chance of adding a temporary to make this more readable.

... mapping = getMapping()->getValue()
auto address_space = mapping.take_front()...

I also don't understand why this needs access to raw .data()

1813–1815

It is less expensive to query the details of the attribute than to construct a new one, the latter needs to take a lock. Something like address_space.getAddressSpace() == gpu::GPUDialect::getWorkgroupAddressSpace() is preferable here.

mlir/lib/Dialect/Linalg/Transforms/Promotion.cpp
406

Ultra-nit: extra whitespace between the function and its documentation.

This revision now requires changes to proceed.Feb 24 2023, 11:54 AM
tavakkoliamirmohammad marked 7 inline comments as done.
This comment was removed by tavakkoliamirmohammad.
tavakkoliamirmohammad planned changes to this revision.Feb 24 2023, 9:47 PM
tavakkoliamirmohammad added inline comments.
mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
1808–1812

I wanted to get the first element of the mapping so I have to through the chain from mlir:ArrayAttr to llvm::ArrayRef to the actual data. I thought this be more readable but it seems the new approach is more readable.

tavakkoliamirmohammad updated this revision to Diff 500358.EditedFeb 24 2023, 9:51 PM

Add /// comments. Replaced expensive query attribute with a simple code. Change the description of GPUMemorySpaceMappingAttr to reflect OpenCL terminology. Also, uploaded the patch with -U99999 flag

ftynse accepted this revision.Feb 25 2023, 6:41 AM
This revision is now accepted and ready to land.Feb 25 2023, 6:41 AM

Thank you! I don't have the commit access to LLVM project for applying this patch.

hw-yuexy added a subscriber: hw-yuexy.EditedMar 22 2023, 2:36 PM

Hi Amir,

I have a question regarding the promote op patch.

When I chain 2 promote op together, both trying to promote operands in a tiled linalg.matmul, with the first one copies from global memory to L1, the second one copies from L1 to L0. I found all memref.copy op are placed in the inner most loop with copy sizes dependent on the inner most loop IV. However, this can lead to bad performance because when moving from GM to L1, it's often more profitable to move larger chunks of data (i.e. we can hoist the GM->L1 memref.copy ops into an outer loop, copying more data). I'm wondering if there are plans to make promote op more flexible, so that it can hoist some memref.copy ops to outer loops when possible?

Thanks.

Hi Amir,

I have a question regarding the promote op patch.

When I chain 2 promote op together, both trying to promote operands in a tiled linalg.matmul, with the first one copies from global memory to L1, the second one copies from L1 to L0. I found all memref.copy op are placed in the inner most loop with copy sizes dependent on the inner most loop IV. However, this can lead to bad performance because when moving from GM to L1, it's often more profitable to move larger chunks of data (i.e. we can hoist the GM->L1 memref.copy ops into an outer loop, copying more data). I'm wondering if there are plans to make promote op more flexible, so that it can hoist some memref.copy ops to outer loops when possible?

Thanks.

Hi.
Sorry I didn't get the notification for this message. The reason is mostly, the workgroup or private memory cannot hold the entire data due to low capacity. Mostly, the technique is to do tiling and, in a loop, only copy the data needed to workgroup or global memory. If you know your entire data fits in those memories what you can do is to promote the data before tiling for computation.

Hi Amir,

Thanks for the nice patch! I'm trying to use this patch as-is with as little changes as possible, for my accelerator which unfortunately has a few different types of 'private' memories.
Since the 'mapping' attribute is defined in the td file as DeviceMappingArrayAttr, which requires definition of address space (#gpu.memory_space<workgroup> or #gpu.memory_space<private>) inside the td file (GPUDeviceMappingAttr.td) to inherit from GPUMemorySpaceMappingAttr. This means, I would need to modify MLIR (e.g. potentially add my hardware dialect) with my private memory spaces in order to use promote op's mapping. I know it is not recommended to expose the address space as a IntegerAttr or UnitAttr. However, it would be more flexible to be honest since memspace in a memref is an integer...

Any advice would be greatly appreciated! Thank you!

Amy