Currently, we only load an image if its supported architecture matches
all of the devices found. This prevents us from supporting a system with
multiple GPUs from the same vendor but different architectures. This
patch makes a very simple change that returns that the image is
incompatible only once we've searched every device.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Currently, we only load an image if its supported architecture matches all of the devices found.
Cannot understand
On my machine with one gfx906 GPU and one sm_86 GPU. When I build offload to both GPUs.
Neither image supports both GPUs but they are both loaded fine.
clang++ -fopenmp --offload-arch=gfx906,sm_80 main.cpp LIBOMPTARGET_DEBUG=1 ./a.out Libomptarget --> Init target library! Libomptarget --> Loading RTLs... Libomptarget --> Loading library 'libomptarget.rtl.ppc64.nextgen.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.nextgen.so': libomptarget.rtl.ppc64.nextgen.so: cannot open shared object file: No such file or directory! Libomptarget --> Falling back to original plugin... Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.x86_64.nextgen.so'... Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.nextgen.so'! Libomptarget --> Registering RTL libomptarget.rtl.x86_64.nextgen.so supporting 4 devices! Libomptarget --> Loading library 'libomptarget.rtl.cuda.nextgen.so'... Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.nextgen.so'! Libomptarget --> Registering RTL libomptarget.rtl.cuda.nextgen.so supporting 1 devices! Libomptarget --> Loading library 'libomptarget.rtl.aarch64.nextgen.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.nextgen.so': libomptarget.rtl.aarch64.nextgen.so: cannot open shared object file: No such file or directory! Libomptarget --> Falling back to original plugin... Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.ve.nextgen.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.ve.nextgen.so': libomptarget.rtl.ve.nextgen.so: cannot open shared object file: No such file or directory! Libomptarget --> Falling back to original plugin... Libomptarget --> Loading library 'libomptarget.rtl.ve.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.amdgpu.nextgen.so'... Libomptarget --> Successfully loaded library 'libomptarget.rtl.amdgpu.nextgen.so'! Libomptarget --> Registering RTL libomptarget.rtl.amdgpu.nextgen.so supporting 1 devices! Libomptarget --> Loading library 'libomptarget.rtl.rpc.nextgen.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.nextgen.so': libomptarget.rtl.rpc.nextgen.so: cannot open shared object file: No such file or directory! Libomptarget --> Falling back to original plugin... Libomptarget --> Loading library 'libomptarget.rtl.rpc.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.rpc.so': libomptarget.rtl.rpc.so: cannot open shared object file: No such file or directory! Libomptarget --> RTLs loaded! Libomptarget --> OMPT: Enter ompt_init OMPT --> OMPT: Trying to load library libomp.so OMPT --> OMPT: Trying to get address of connection routine ompt_libomp_connect OMPT --> OMPT: Library connection handle = 0x7fb8c650f5d0 Libomptarget --> OMPT: Exit ompt_init Libomptarget --> Image 0x00005635224dc0d8 is NOT compatible with RTL ! PluginInterface --> Image is compatible with current environment: sm_80 Libomptarget --> Image 0x00005635224dc0d8 is compatible with RTL ! Libomptarget --> RTL 0x000056352392e0a0 has index 0! Libomptarget --> Registering image 0x00005635224dc0d8 with RTL ! Libomptarget --> Image 0x00005635224fbb80 is NOT compatible with RTL ! Libomptarget --> Image 0x00005635224fbb80 is NOT compatible with RTL ! TARGET AMDGPU RTL --> Compatible: Target IDs are compatible [Image: gfx906] : [Env: gfx906:sramecc+:xnack-] PluginInterface --> Image is compatible with current environment: gfx906 Libomptarget --> Image 0x00005635224fbb80 is compatible with RTL ! Libomptarget --> RTL 0x0000563523963d40 has index 1! Libomptarget --> Registering image 0x00005635224fbb80 with RTL ! Libomptarget --> Done registering entries! Libomptarget --> Entering target region for device -1 with entry point 0x00005635224dc004 Libomptarget --> Call to omp_get_num_devices returning 2 Libomptarget --> Default TARGET OFFLOAD policy is now mandatory (devices were found) Libomptarget --> Use default device id 0 Libomptarget --> Call to omp_get_num_devices returning 2 Libomptarget --> Call to omp_get_num_devices returning 2 Libomptarget --> Call to omp_get_initial_device returning 2 Libomptarget --> Checking whether device 0 is ready. Libomptarget --> Is the device 0 (local ID 0) initialized? 0 TARGET CUDA RTL --> The primary context is inactive, set its flags to CU_CTX_SCHED_BLOCKING_SYNC Libomptarget --> Device 0 is ready to use. PluginInterface --> Load data from image 0x00005635224dc0d8 PluginInterface --> Succesfully write 16 bytes associated with global symbol '__omp_rtl_device_environment' to the device (0x7fb78be00000 -> 0x7ffd028aec10). TARGET CUDA RTL --> Entry point 0x00005635224ff048 maps to __omp_offloading_10307_18c6198_main_l4 (0x000056352458f130) PluginInterface --> Global symbol '__omp_offloading_10307_18c6198_main_l4_exec_mode' was found in the ELF image and 1 bytes will copied from 0x5635224fa770 to 0x7ffd028aeaa0. PluginInterface --> Entry point 0x0000000000000000 maps to __omp_offloading_10307_18c6198_main_l4 (0x0000563524558980) Libomptarget --> loop trip count is 0. Libomptarget --> Launching target execution __omp_offloading_10307_18c6198_main_l4 with pointer 0x0000563524558980 (index=0). PluginInterface --> Launching kernel __omp_offloading_10307_18c6198_main_l4 with 1 blocks and 128 threads in Generic mode Libomptarget --> Unloading target library! Libomptarget --> Unregistered image 0x00005635224dc0d8 from RTL 0x000056352392e0a0! Libomptarget --> Unregistered image 0x00005635224fbb80 from RTL 0x0000563523963d40! Libomptarget --> Done unregistering images! Libomptarget --> Removing translation table for descriptor 0x00005635224ff048 Libomptarget --> Done unregistering library! Libomptarget --> Deinit target library!
I should have specified "If its architecture matches all of the devices found for a single plugin" E.g. currently we will not initialize two devices on a machine with both gfx90a and gfx908.
What we currently have conservatively determines that a plugin is compatible to the image if all devices managed by the plugin are compatible. This is because when we initialize the device vector, we use a one-for-all style: if a plugin is compatible, we assume we can use all devices managed by the plugin. This patch will break the assumption.
So if the system has a gfx908 and a gfx90a and --offload-arch=gfx908,gfx90a then it will say both images are not compatible because each image does not support both devices.
It sounds like this might break some invariants we have about plugins, devices, and images.
If this does not work, some alternative ideas:
What if we make the plugins detect that they have different architectures and they create one instance per arch?
Though, we could also force the plugin to choose. If device 0 works with in image and device 1 does not, we ignore 1 from now on.
Yes, the problem right now is that we assign device ID's based on the logical devices found, not whether or not they loaded an image. So to make this work we would need to initialize the devices in order of images we find them compatible for. So on a system offloading to gfx90a and gfx908, if we had a gfx908 image, we would find it matches gfx908 and assign that device to zero. This is definitely a break from the current logic where we simply initialize all the devices found.