Page MenuHomePhabricator

[Libomptarget] Improve device runtime implementation for globalized variables.

Authored by jhuber6 on Jun 21 2021, 1:00 PM.



Currently the runtime implementation of __kmpc_alloc_shared is extremely slow because it allocated memory for each thread individually. This patch adds a small buffer for the threads to share data and will greatly improve performance for builds where all globalization could not be optimized out. If the shared buffer is full, then memory will not only be allocated per-warp rather than per-thread.

Depends on D97680

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 21 2021, 1:00 PM
jhuber6 requested review of this revision.Jun 21 2021, 1:00 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2021, 1:00 PM
tianshilei1992 added inline comments.Jun 21 2021, 1:21 PM

IIRC, in the new deviceRTLs, we only have one stack where the first chunk, which is bigger than the rest, is for the main thread in non-SPMD mode. Why do we want to have two here?

jdoerfert added inline comments.Jun 21 2021, 1:41 PM

Was easier to write it like this from scratch, either way works, no real difference, this might be actually nicer.

This revision is now accepted and ready to land.Jun 21 2021, 10:13 PM
This revision was landed with ongoing or failed builds.Jun 22 2021, 8:53 AM
This revision was automatically updated to reflect the committed changes.