This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget][WIP] Introduce VGPU Plugin
AcceptedPublic

Authored by atmnpatel on Nov 6 2021, 7:41 PM.

Details

Summary

This patch introduces a virtual GPU (x86) plugin. This allows for the
emulation of the GPU environment on the host. This re-uses the same
execution model, compilation paths, runtimes as a physical GPU. The
number of threads, warps, and CTAs are set through the environment
variables VGPU_{NUM_THREADS,NUM_WARPS,WARPS_PER_CTA} respectively.

Known Bugs:

  • There is UB somewhere in the DeviceRTL that occasionally sets stride to zero, causing a FPE segfault.

Diff Detail

Event Timeline

atmnpatel created this revision.Nov 6 2021, 7:41 PM
atmnpatel requested review of this revision.Nov 6 2021, 7:41 PM
Herald added projects: Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptNov 6 2021, 7:41 PM
jdoerfert added inline comments.Nov 8 2021, 7:59 AM
clang/lib/CodeGen/CGOpenMPRuntimeVirtualGPU.cpp
54 ↗(On Diff #385318)

We should be able to get rid of this file (and the cuda/hip) version. Might be the right time now as a precommit.

llvm/include/llvm/ADT/Triple.h
169

Let's call it OpenMP_VGPU or something like that to make it clear.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
2177 ↗(On Diff #385318)

@tianshilei1992 This needs a test.

openmp/libomptarget/DeviceRTL/src/Kernel.cpp
107 ↗(On Diff #385318)

I don't think we should do this. Instead, the plugin should signal as threads finish the kernel.

openmp/libomptarget/DeviceRTL/src/Mapping.cpp
238

We probably should use kind(CPU) or something instead. Nothing x86 specific about it I think.

openmp/libomptarget/include/DeviceEnvironment.h
83 ↗(On Diff #385318)

This should go into a new file (ThreadEnvironment)

I removed the shared var opt - might be best to keep this in a separate patch @tianshilei1992. Also addressed comments.

small nit fix

tianshilei1992 added inline comments.Nov 11 2021, 9:04 AM
clang/lib/Driver/ToolChains/Gnu.cpp
3077

Maybe "x86_64-openmp_vpu" now?

llvm/lib/Support/Triple.cpp
189

"openmp_vpu"?

I can't see it in the diff - does the cmake somewhere enable the existing tests on this new target?

A bit surprised to see ffi involved, are we thinking of spawning a separate process for the target?

clang/lib/Basic/Targets/X86.h
49

It's not clear to me what this is x86 specific. Being able to run our tests on power / arm etc seems like an advantage. Would also mean we would avoid adding openmp stuff the x86 specific files. Maybe OpenMPVGPUAddrSpaceMap and put it in one of the openmp source files?

clang/lib/Frontend/CompilerInvocation.cpp
3990

Add a isOpenmpVGPU function?

openmp/libomptarget/DeviceRTL/CMakeLists.txt
140

Should only add this include to the vgu, not all the plugins. May be able to use relative include paths to drop it entirely

atmnpatel updated this revision to Diff 398370.Jan 8 2022, 1:13 PM
  • Fixed lifetime issue around ffi_call
  • Addressed comments

The existing x86 plugin uses ffi, so this does as well, no explicit benefit in doing so. Is it worth keeping?

jdoerfert added inline comments.Jan 10 2022, 7:00 AM
llvm/lib/Support/Triple.cpp
512
openmp/libomptarget/DeviceRTL/src/Debug.cpp
53
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
128 ↗(On Diff #398370)
openmp/libomptarget/DeviceRTL/src/Mapping.cpp
29
openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
291
315

Pass the memory order, also rename the arguments to match the coding convention.

318

Pass the mask

openmp/libomptarget/DeviceRTL/src/Utils.cpp
54
66

Can't we merge this with AMDGPU?

118
openmp/libomptarget/plugins/vgpu/src/rtl.cpp
304

Can we split this up and create some helper functions maybe?

openmp/libomptarget/src/rtl.cpp
27

Introduce an environment variable, if it is set, X86 target should skip the image.
Also, add a TODO such that we later look into the image and inspect it to decide automatically.

openmp/libomptarget/test/lit.cfg
189 ↗(On Diff #398370)

Leftovers.

tianshilei1992 added inline comments.Jan 12 2022, 6:55 AM
openmp/libomptarget/DeviceRTL/CMakeLists.txt
236

It's not a good practice to specify include directories in CMake in this way. Use include_directories instead.

openmp/libomptarget/DeviceRTL/src/Kernel.cpp
127 ↗(On Diff #398370)

Are these code here unintentional? We don't need to specialize this function for vgpu IIRC.

ormris removed a subscriber: ormris.Jan 18 2022, 10:04 AM
jdoerfert added inline comments.Jan 18 2022, 12:50 PM
openmp/libomptarget/DeviceRTL/src/Kernel.cpp
127 ↗(On Diff #398370)

we might be able to avoid it if we move the synchronize::threads "effect" into the VGPU instead.

atmnpatel edited the summary of this revision. (Show Details)Jan 18 2022, 11:32 PM
atmnpatel marked 9 inline comments as done.

Addressed comments

atmnpatel added inline comments.Jan 18 2022, 11:36 PM
openmp/libomptarget/DeviceRTL/CMakeLists.txt
236

can't quite do that here I think, afaik both include_directories and target_include_directories require that CMake builds the target, but we specify custom targets/build commands so they don't get pulled in

jdoerfert added inline comments.Feb 2 2022, 9:45 AM
clang/lib/Driver/ToolChains/Gnu.cpp
3077

not x86, right? triple contains the proper arch

openmp/libomptarget/DeviceRTL/src/Mapping.cpp
30

Move up to the beginning.

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
292

Move up.

343

We should simply use omp locks. Either here, or maybe better, in VGPUImpl. So redirect all calls to there and use a proper lock. no OMP_SPIN and stuff

openmp/libomptarget/DeviceRTL/src/Utils.cpp
119

Move up

128

Pass the mask, both times.

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironment.cpp
50

see above.

openmp/libomptarget/src/rtl.cpp
112

Not only x86, also let's not do strcmp. Extend RTLNAmes to be an array of structs with more elaborate information, e.g., is host flag. That said, unsure if not loading the plugin is the right way to not grab the image. Good enough for now.

openmp/libomptarget/test/CMakeLists.txt
23

This is to disable the tests? Not sure this is a good way though. For one, can we check against -vgpu not x86, also openmp-vgpu or something, right?

atmnpatel updated this revision to Diff 405407.Feb 2 2022, 1:19 PM
atmnpatel marked 7 inline comments as done.

updates

openmp/libomptarget/test/CMakeLists.txt
23

Yep

jdoerfert accepted this revision.Feb 3 2022, 9:26 AM

LG, with some things to address before the merge though.

Didn't we have a pass to expand shared memory (and such)?

clang/lib/Basic/TargetInfo.cpp
155

use isOpenMPVGPU

clang/lib/Basic/Targets/X86.h
395

Do we need the changes in this file at all? I couldn't see why.

clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
1125

isOpenMPVGPU

clang/lib/CodeGen/CodeGenModule.cpp
252

isOpenMPVGPU

clang/lib/Driver/ToolChains/Gnu.cpp
3076

isOpenMPVGPU

openmp/libomptarget/DeviceRTL/src/Synchronization.cpp
323

Remove these. Also the TODO below (copied from somewhere)

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.cpp
85

This is racy, I think. Can we use atomic_add for all these Idx updates or pass the Id from the outside?

openmp/libomptarget/plugins/vgpu/src/ThreadEnvironmentImpl.h
118

at least add more information what the problem and potential solutions are.

openmp/libomptarget/plugins/vgpu/src/rtl.cpp
271

Move the lambda into a helper function. indention of 12 is too much.

313

When do we have more threads than NumThreads?

554

if we need for submit/retrieve, I'd assume to wait here too.

This revision is now accepted and ready to land.Feb 3 2022, 9:26 AM

Not sure if it's good to merge such a large patch. We could potentially split the patch to three independent patches: tool chain, device runtime, and the OpenMPOpt pass to support expansion of shared variable (which for some reason is not included in this patch. That is actually very important component otherwise the backend will complain about it).

We can merge runtime first, build it in isolation, then libomptarget host runtime, then clang.

Also make sure to adjust the commit messages

@jdoerfert @tianshilei1992 @atmnpatel @dhruvachak

Is the target to get this merged in for LLVM 16? Does the VGPU implementation provide a way to support OMPT callbacks for various constructs (parallel, worksharing, barriers, etc.)?

Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2022, 10:13 AM
Herald added a subscriber: MaskRay. · View Herald Transcript