This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Multi architecture compilation support
Needs ReviewPublic

Authored by saiislam on Jul 27 2021, 6:00 AM.

Details

Summary

Multiple offloading targets can now be specified in the command
line. An instance of toolchain is created for each unique
combination of Target Triple and Target GPU. Device runtime has
been modified to support binaries containing multiple images,
each for a different target.
Data structure "__tgt_image_info" defined in
"llvm-project/openmp/libomptarget/include/omptarget.h" is used
to pass requirements of each image. E.g. GPU name like gfx906,
sm35, etc are the requirements of the image, which is produced
by clang-offload-wrapper and read by device RTL.

Example:

clang  -O2  -target x86_64-pc-linux-gnu -fopenmp \
  -fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa \
  -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 \
  -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 \
 helloworld.c -o helloworld

Diff Detail

Event Timeline

saiislam created this revision.Jul 27 2021, 6:00 AM
saiislam requested review of this revision.Jul 27 2021, 6:00 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 27 2021, 6:00 AM
ye-luo added a subscriber: ye-luo.EditedJul 27 2021, 6:41 AM

-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa seems burdensome. Could you just count how many -Xopenmp-target=amdgcn-amd-amdhsa there are on the comand line and then count the unique ones?

to me -fopenmp-targets=amdgcn-amd-amdhsa,nvptx64 makes sense.
-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa is not optimal.

-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa seems burdensome. Could you just count how many -Xopenmp-target=amdgcn-amd-amdhsa there are on the comand line and then count the unique ones?

I have a patch in pipeline which will eliminate need of (-fopenmp-targets, -Xopenmp-target, and -march) altogether. User will be able to compile with just "--offload-arch=gfx906" instead of using the other three flags.
It is working in our downstream AOMP Compiler but I haven't posted a phab review yet.

saiislam added a subscriber: ronlieb.
saiislam added inline comments.Jul 27 2021, 6:59 AM
openmp/libomptarget/src/rtl.cpp
306

Call to amdgpu-arch binary is going to be replaced with call to a new library named OffloadArch. It will return current GPU name along with enabled GPU features (i.e. requirements) in a platform-independent way. As the library and its various functionalities are self-contained I decided to post it is a separate review and use amdgpu-arch here for demonstration.
I will be posting the phab review for the library soon.

-fopenmp-targets=amdgcn-amd-amdhsa,amdgcn-amd-amdhsa seems burdensome. Could you just count how many -Xopenmp-target=amdgcn-amd-amdhsa there are on the comand line and then count the unique ones?

I have a patch in pipeline which will eliminate need of (-fopenmp-targets, -Xopenmp-target, and -march) altogether. User will be able to compile with just "--offload-arch=gfx906" instead of using the other three flags.
It is working in our downstream AOMP Compiler but I haven't posted a phab review yet.

That is just a convenient option and separate topic. I'm commenting on the current generic option you are fiddle with.

JonChesterfield added a comment.EditedJul 27 2021, 8:01 AM

There seems to be a bunch of different things in this patch.

There's some driver plumbing to compile for more than one arch (presumably by calling the target compiler N times). That's a great feature, I want to build an application that can run on nvptx or amdgpu. Probably need a test case showing that combination.

Then there's a bunch of stuff to do with 'requirements', but it's not clear what that is.

Finally there's some stuff where libomptarget dlopens itself then spawns amdgpu-arch. I can't tell why we would want to do that.

My guess was that each arch would get its own section in the host executable containing a code object and each host plugin would be responsible for indicating whether it could do anything with a given code object. That should work out of the box for machines with only one offloading arch available, and need some work around device_id to handle multiple ones.

saiislam added inline comments.Jul 28 2021, 8:01 AM
openmp/libomptarget/src/rtl.cpp
306

Here is the patch for the OffloadArch library: D106960

I think this patch needs to split up into a large number of much smaller pieces.

Spent some time reading through this. I think the idea is to create a host binary that contains code objects for multiple variants of amdgpu - e.g. one that runs on gfx906 and another on gfx908, or one that runs on gfx906-xnack+ and another on gfx906-xnack-.

That's close to the long running feature request to compile a program to a binary that can run on totally different architectures, e.g. nvptx + amdgpu + vgpu + remote. Probably in the first instance making one binary that can run on whatever and then extending it to run on a system that has multiple targets available. I've got a nvptx / amdgpu box here that would be well suited to testing that. Tagging Ron and Pushpinder who may be interested in such.

Can you document the device binary embedding scheme for multiple GPU's in clang documentation? This will help tool developers to develop tools to extract device binaries from executables or shared libraries. Also this may help interoperability with other offloading language modes in case multiple offloading are desired to be supported in one executable or shared library in the future.