For demonstration/discussion purposes only, not meant to be reviewed/submitted.
This is the code of the 'early prototype' presented during the GPU host-side dialect discussion during MLIR's design meetings (slides and recording:
https://drive.google.com/corp/drive/folders/1-93qa9Esu2m0_xoZrB_5x3CjkIUqL_DD)
Instead of a new 'async' op with a region, the prototype adds a variadic list of 'gpu.chain' inputs and one 'gpu.chain' output to individual async ops.