Page MenuHomePhabricator

[libomptarget][devicertl] Port amdgcn devicertl to openmp
Changes PlannedPublic

Authored by JonChesterfield on Dec 11 2020, 11:25 AM.



[libomptarget][devicertl] Port amdgcn devicertl to openmp

Uses macros to abstract over cuda and openmp syntax for shared variables.
Cuda ignores unknown #pragma, so that macro plus declare target is mostly
sufficient. Nvptx still compiles as cuda for now.

This change is localised to the runtime. It can be improved later by changing
the compiler, e.g. to implicitly wrap a CU with the pragma target, or to bring
the intrinsics into scope implicitly.

Diff Detail

Event Timeline

JonChesterfield requested review of this revision.Dec 11 2020, 11:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 11 2020, 11:25 AM

Trunk clang doesn't accept these flags just yet, but trunk clang also refuses to compile this code as -x hip so this isn't much of a regression.


This is the only array variable, cleaner to write the pragma allocate here than to introduce another macro

Passes internal test suites. Can add the nvptx cmake file to this change instead of as a follow on. I will need help checking it works on nvidia hardware either way.

JonChesterfield added a comment.EditedJan 4 2021, 2:26 AM

Just re-read this and still agree with the direction, but have seen a couple of things to double check. I don't mind the SHARED foo -> SHARED(foo) syntax change or pragma noise. The 'todo' comments about what look like compiler bugs are a concern, but they won't be the only compiler bugs that need to be fixed for amdgpu, and this runtime may be the most complicated openmp code I've tried to compile.

I have checked that this is non-functional change for nvptx (by compiling the deviceRTL to bitcode with+without it and diff the IR, as that's more robust than the tests I can run locally). Changing cuda to openmp doesn't generate the same IR at present so I'd like to be cautious about the change over. Likewise changing rocm -x hip to rocm -x openmp doesn't generate the same code for amdgpu, but in that case I know the different IR does still work.

I'd like to convert amdgpu to openmp as an intermediate step towards doing the same for nvptx. The above work around notes suggest the rocm compiler is doing some strange things for openmp codegen. I'd like to ignore that in favour of getting the trunk compiler to generate the right code for openmp, at which point rocm will pick up those fixes and I won't have to write the patches twice. Then, with trunk generating the right code for amdgpu, it should be straightforward to switch nvptx over.

I can also prepare a prettier patch that doesn't include those workarounds (e.g. the omp_is_initial_device bodge), which will make the code in trunk look better, with the note that it'll waste a bunch of my time keeping the internal builds working.


Note to self: Does the allocate clause need to go on the declaration? According to spec and/or implementation? Ideally the header would only say 'there is a variable somewhere' and the implementation would specify that it's from a particular allocator, but that means IR gen will use a generic address space which may not work out


Note to self - don't use this bodge for nvptx

The declare target looks good, some stuff seems unrelated though.


This seems unrelated, can it go in before or after?


why the enum? Can we move the _OPENMP stuff in a generic header?


Nit: I'd remove the comment.


D38968 is the reason. We should revert that patch as we have context selectors now.

  • Comment, mark a hack as amdgpu only
JonChesterfield marked an inline comment as not done.
  • move openmp macro to common header, reinstate building as hip

I think the problem goes:

  • openmp compiles target regions for the host
  • __builtin_amdgcn_s_sleep can't be called from host code (as it's not a thing)
  • therefore compiler doesn't add the symbol to host compilation

This works for device compilation, as the names and types match the functions that actually exist. It 'works' for host compilation, in so far as it emits calls to undefined functions with these names, and the compiler throws away the result of the host compilation.

Cuda has a similar quirk where calling a device intrinsic from a function that is compiled for the host fails. I'm not sure if nvptx/openmp will have the same behaviour.

I'm not sure what the pretty solution to this is. Somewhat inclined to compile target_impl as c++, instead of as openmp, as that bypasses it. I'd quite like to compile the whole library as c++, instead of engaging all the openmp machinery, but clang doesn't handle constructors in address spaces well enough.


This one was fun. I'd have liked to write '7', but clang sema does some very aggressive checking on the allocator clause that only accepts an enum with the right name and all the right fields filled out.

Yes though, we should have a header in common that exports the SHARED and EXTERN_SHARED macros, which will be helpful for nvptx. Provisionally named it omp_pteam_mem_alloc.


Happy to. All of the comment, or just the // static?


Ah, nice. It's a diy constant folding thing. Agreed that's no longer required.

  • Fix whitespace, prefix barrier global kmpc
JonChesterfield marked 2 inline comments as done.Jan 4 2021, 10:15 AM
  • minor cmake tweaks, fix a compiler warning on ignored attribute
  • default to building as hip as that's the current choice

This will be somewhat easier to test if I add the cmake for nvptx as well, since we have a working nvptx openmp compiler in trunk. I'll do that next.

JonChesterfield planned changes to this revision.Jan 22 2021, 7:37 AM

Most (possibly all except the amdgpu cmake) of this change is incorporated in D94745 or already committed to trunk. I'm going to leave this up as a reference for now but don't intend to land it.