This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP] New api ompx_get_device_num_units(int devid)
Needs ReviewPublic

Authored by gregrodgers on Oct 4 2022, 8:28 AM.

Details

Reviewers
jdoerfert
Summary

This returns the number of physical processors that can run a team on the specified device. For AMD, this is the number of CUs. for Nvidia, this is number of SMs. For CPUs, this COULD be number of sockets if multiple teams are supported. This API is needed for optimizing cross team reductions where we want to minimize the number of intermediate per-team reduction values.

If the device id is the initial device, 1 is returned.

Diff Detail

Event Timeline

gregrodgers created this revision.Oct 4 2022, 8:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 8:28 AM
gregrodgers requested review of this revision.Oct 4 2022, 8:28 AM
Herald added a project: Restricted Project. · View Herald Transcript
gregrodgers retitled this revision from [OPENMP] New api ompx_get_team_procs(devid) returns the number of physical processors that can run a team. For AMD, this is the number of CUs. for Nvidia, this is number of SMs. For CPUs, this COULD be number of sockets if multiple teams are... to [OPENMP] New api ompx_get_team_procs(devid) .Oct 4 2022, 8:35 AM
gregrodgers edited the summary of this revision. (Show Details)

The new interface to query the number of physical processors (or whatever it should be called) is fine, but the name is a little bit questionable. It implies how an OpenMP team is mapped to the low level model. I think it is possible that the execution model mapping is per-kernel. It doesn't sound great to have that implication in a general interface name.

The new interface to query the number of physical processors (or whatever it should be called) is fine, but the name is a little bit questionable. It implies how an OpenMP team is mapped to the low level model. I think it is possible that the execution model mapping is per-kernel. It doesn't sound great to have that implication in a general interface name.

Thank you for the comment. It is the detail of the execution model as implemented by the plugin that we need to expose, namely the team. I cannot think of a model or implementation that does not map one or more teams to something physical (hence word "proc"). This commit updates all the plugins that have devices including generic-elf-64bit. I can imagine an implementation that maps a team to a host socket in a multi-socket system. This is why I did not put the word "target" in the name because the devid could be the host or initial device. This is why ompx_get_team_proc(omp_get_initial_device()) returns 1.

To minimize cross team communication, one needs to know the minimum teams to activate all hardware. See https://reviews.llvm.org/D135299 for how ompx_get_team_procs(devid) will be used to allocate storage for cross team (xteam) communication.

There is an existing api omp_get_num_procs() which currently maps to the number of hardware threads in block. This is callable in a target region and I believe it is typically the number of threads. But the number of teams is much different than the physical location of where a team can run.

Currently ompx_get_team_procs(devid) is only callable from host.

The change to openmp/libomptarget/plugins/exports needs to be deleted. The upstream changed this file to include all functions beginning with __tgt_rtl

gregrodgers retitled this revision from [OPENMP] New api ompx_get_team_procs(devid) to [OPENMP] New api ompx_get_team_procs(devid).
gregrodgers edited the summary of this revision. (Show Details)
  • Remove change to openmp/libomptarget/plugins/exports so this revision will merge with current trunk

As @tianshilei1992 noted, the naming is not great. Team can technically map to lots of things and soon they will. What about "thread groups"? Or even "sockets"?
Also, there is some duplication I pointed out below.

openmp/libomptarget/include/device.h
325

It's unclear why we need to store this in two places, the plugins and here. Other device data only lives in the plugins, this should too.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
357

This should go into DeviceData, and in the new plugin interface it's different again.

  • Merge branch 'main' into arcpatch-D135162
  • move NumberOfTeamProcs into DeviceData per comment from jdoerfert

"Team can technically map to lots of things and soon they will. What about "thread groups"? Or even "sockets"?
Also, there is some duplication I pointed out below."

But teams are exactly what we are trying to address with ompx_get_team_procs(). How many physical things can a team "map to"/"run on"? This is needed because a user (or runtime) may want to limit the number of teams created to utilize all (or some subset) of the number physical team processors. At one point I was thinking of utilizing the places API in OpenMP 5.2 spec chapter 18.3. If you imagined a place was like a device and a place partition was like a CU (or nvidisa SM) then omp_get_partition_num_places(). One of several problems with using the places API is that it is internal, I need an external API to help in setting number of teams on the specified device id.

The other big problem with places api is that it is written for thread management to work with thread affinity. I think overloading this with a team (group of threads) would get us in trouble fast.

I realize that the number of teams is typically application specific (or should be) and having this API may be a gun aimed at ones foot. But when cross team coordination is necessary, it can be beneficial to limit the number of teams so as to utilize all the hardware while minimizing the coordination among teams.

I addressed both of your inline comments. One with a fix and the other with an explanation..

openmp/libomptarget/include/device.h
325

This is the value on the host DeviceTy that the getter and setter access. The getter is for the new external api ompx_get_team_procs(devid). The setter is called when the device is initialized and gets the value to set from the plugin which now stores the value in the DeviceData.

openmp/libomptarget/plugins/cuda/src/rtl.cpp
357

Done.

"Team can technically map to lots of things and soon they will. What about "thread groups"? Or even "sockets"?
Also, there is some duplication I pointed out below."

But teams are exactly what we are trying to address with ompx_get_team_procs(). How many physical things can a team "map to"/"run on"? This is needed because a user (or runtime) may want to limit the number of teams created to utilize all (or some subset) of the number physical team processors.

I understand all that but as I said, CUs/SMs are not the only thing we can map teams on to. I could map 2 teams on each of them. Or one team one two of them (potentially with some compiler fixup logic). Etc.

Your use case is: Get me the number of processor groups so I can determine the number of teams I want. The hardware doesn't provide "team processors" but something else. And with a certain mapping in mind it makes sense to query the number of these things. However, conflating this with "teams" on the user level is problematic as "teams" can map in various ways, as mentioned.

At one point I was thinking of utilizing the places API in OpenMP 5.2 spec chapter 18.3. If you imagined a place was like a device and a place partition was like a CU (or nvidisa SM) then omp_get_partition_num_places(). One of several problems with using the places API is that it is internal, I need an external API to help in setting number of teams on the specified device id.

I can see external APIs to be useful for queries but setting the number should not be a thing. Getters reach the plugin managing the device and return the value of "hardware thing-ys".

The other big problem with places api is that it is written for thread management to work with thread affinity. I think overloading this with a team (group of threads) would get us in trouble fast.

Fair, I don't think this extension needs to unify such things.

openmp/libomptarget/include/device.h
325

But why do we need to store it in the plugin and here? No other information, e.g., max num threads, is stored twice. This just copies the value from the plugin once, which seem to provide little benefit.

Is this an LLVM-only API or will other vendors support it as well? Is there an òmpx_llvm_ namespace ?

Is this an LLVM-only API or will other vendors support it as well? Is there an òmpx_llvm_ namespace ?

It will be LLVM only at first. Unsure if the others pick it up, there is a chance. llvm_ instead of ompx_ works for me, doing both is a mouthful but sure.

I think I get your rational now. A thread executes on a places proc. In fact many threads can execute on (map to) a place proc. We need the same sort of "mapping" for a team or thread group. We did not call a place a thread_proc.
As the name is now, a team_proc is a place where a team can run, That can be a CU, an SM, a VE, a host socket, a host numa domain, a node, or something. The code in this review covers all existing (in trunk) offload plugins that can execute a team.

So I am still searching for the right term. The term would fill in the blank on these two sentences. A place is to a thread as a _____ is to a team. One or more treads execute on a places proc, whereas one or more teams executes on a _______ proc.
The current thread API is omp_get_place_num_procs(int place_num). The new ompx API for this review should be something like omp_get_NEWTERM_num_procs(int device).

I am open to suggestions.

See answer to inline comment.

openmp/libomptarget/include/device.h
325

"No other information is stored twice"

Not exactly. I tried to follow the mechanics of the omp_get_num_devices API as close as possible. The plugin API tgt_rtl_number_of_devices returns DeviceRTL.getNumbofDevices and initializes the PM->Devices vector in the plugin manager (outside the plugin) during initialization.
Similarly the new plugin API
tgt_rtl_number_of_team_procs returns the value stored in the plugin at initialization which is also stored in the PM. I don't understand why there is a lock in omp_get_num_devices around access to PM, but I copied the lock in ompx_get_team_procs. I was just trying to be consistent with an existing API.

gregrodgers added a comment.EditedOct 17 2022, 9:51 PM

Is the name ompx_get_device_num_units(devid) acceptable?

I agree that òmpx_llvm_ is a mouthful. But the OpenMP standard reserved the prefix òmpx_ for implementation-definded functions.

I agree that òmpx_llvm_ is a mouthful. But the OpenMP standard reserved the prefix òmpx_ for implementation-definded functions.

We are an implementation, hence we can use ompx_, that's the point.

  • change name from ompx_get_team_procs(devid) to ompx_get_device_num_units(devid). Internal names have also been changed
gregrodgers retitled this revision from [OPENMP] New api ompx_get_team_procs(devid) to [OPENMP] New api ompx_get_device_num_units(int devid).Oct 19 2022, 1:17 PM

I use ompx_ in research on top of the LLVM OpenMP. I have no plans to build an OpenMP runtime.