This is an archive of the discontinued LLVM Phabricator instance.

[OPENMP]Make __kmpc_push_tripcount thread safe.
ClosedPublic

Authored by ABataev on Jul 2 2019, 7:40 AM.

Details

Summary

__kmpc_push_tripcount function is not thread safe and may lead to data
race when the target regions are executed in parallel threads. The patch
makes loopTripCnt counter thread aware and stores the tripcount value
per thread in the map. Access to map is guarded by mutex to prevent
data race in the map itself.
Test is for NVPTX target because it does not work correctly on the
host. Seems to me, there is a problem in libomp with target regions in
the parallel threads.

Diff Detail

Repository
rL LLVM

Event Timeline

ABataev created this revision.Jul 2 2019, 7:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 2 2019, 7:40 AM
jdoerfert accepted this revision.Jul 2 2019, 12:50 PM

This looks generally good to me and I think we should get this fix in. I have one nit inlined, and two follow up questions below:

  1. Do we need to be interoperable with std::thread? I mean, would it be sufficient to ask for the OpenMP thread ID instead of the std::this_thread::get_id()?
  2. Shouldn't we move the whole loop trip count logic into the target offload call? Afaik, we currently emit sth. like:
__kmpc_set_upcoming_trip_count(N);
__kmpc_target(....);

Now if we would move the N as a parameter into the "target" call we would not need to store it anyway, or am I missing something here?

libomptarget/src/omptarget.cpp
738 ↗(On Diff #207559)

Nit: std::swap(ltc, I->second) and no braces.

This revision is now accepted and ready to land.Jul 2 2019, 12:50 PM

This looks generally good to me and I think we should get this fix in. I have one nit inlined, and two follow up questions below:

  1. Do we need to be interoperable with std::thread? I mean, would it be sufficient to ask for the OpenMP thread ID instead of the std::this_thread::get_id()?

It would be good to use OpenMP thread id, but, unfortunately, OpenMP thread id is not passed to the offloading functions (tgt_target etc.). That was the reason to use std::this_thread::get_id() instead of OpenMP thread id.
We can call __kmpc_global_thread_num(nullptr) directly instead. libomptarget already depends on libomp to support async offloading, so, maybe, we can use this libomp function instead of std::thread.

  1. Shouldn't we move the whole loop trip count logic into the target offload call? Afaik, we currently emit sth. like:
__kmpc_set_upcoming_trip_count(N);
__kmpc_target(....);

Now if we would move the N as a parameter into the "target" call we would not need to store it anyway, or am I missing something here?

I thought about it and I think it is a good idea. But it is a completely different problem. We can't change the interface of the existing functions (for compatibility), instead, we need to provide the whole bunch of new functions with some new params. I just want to fix the problem with the existing kmpc_push_tripcount function. Redesign requires some additional work.

ABataev updated this revision to Diff 207620.Jul 2 2019, 1:50 PM
  1. Used std::swap.
  2. Removed dependency on std::thread, used OpenMP thread id instead.
jdoerfert accepted this revision.Jul 2 2019, 2:18 PM

This looks generally good to me and I think we should get this fix in. I have one nit inlined, and two follow up questions below:

  1. Do we need to be interoperable with std::thread? I mean, would it be sufficient to ask for the OpenMP thread ID instead of the std::this_thread::get_id()?

It would be good to use OpenMP thread id, but, unfortunately, OpenMP thread id is not passed to the offloading functions (tgt_target etc.). That was the reason to use std::this_thread::get_id() instead of OpenMP thread id.
We can call __kmpc_global_thread_num(nullptr) directly instead. libomptarget already depends on libomp to support async offloading, so, maybe, we can use this libomp function instead of std::thread.

Looks better now.

  1. Shouldn't we move the whole loop trip count logic into the target offload call? Afaik, we currently emit sth. like:
__kmpc_set_upcoming_trip_count(N);
__kmpc_target(....);

Now if we would move the N as a parameter into the "target" call we would not need to store it anyway, or am I missing something here?

I thought about it and I think it is a good idea. But it is a completely different problem. We can't change the interface of the existing functions (for compatibility), instead, we need to provide the whole bunch of new functions with some new params. I just want to fix the problem with the existing kmpc_push_tripcount function. Redesign requires some additional work.

Fair point. Let's put that on the TODO list and talk about what else we want to include on the mailing list.

Good to commit.

grokos added a comment.Jul 2 2019, 2:56 PM

A long time ago we had identified the problem with the loop trip count and if I recall correctly the proposed solution was to store the trip count per OpenMP task. The fix would be implemented in libomp because libomptarget has no notion of tasks.

@AlexEichenberger Can you confirm whether or not this is the case? I may confuse this with something else...

A long time ago we had identified the problem with the loop trip count and if I recall correctly the proposed solution was to store the trip count per OpenMP task. The fix would be implemented in libomp because libomptarget has no notion of tasks.

@AlexEichenberger Can you confirm whether or not this is the case? I may confuse this with something else...

The patch does exactly what you said: stores tripcount per task, but in the libomptarget, not libomp. When it will be implemented in libomp, we could remove this code. But we need fully functional implementation now, not in the future.
Plus, libomptarget already communicates with libomp using kmpc_omp_taskwait,though incorrectly. It uses 0 as thread ID always, though it must use real OpenMP thread ID, not 0. I'm going to fix this in the next patch.

A long time ago we had identified the problem with the loop trip count and if I recall correctly the proposed solution was to store the trip count per OpenMP task. The fix would be implemented in libomp because libomptarget has no notion of tasks.

@AlexEichenberger Can you confirm whether or not this is the case? I may confuse this with something else...

The patch does exactly what you said: stores tripcount per task, but in the libomptarget, not libomp. When it will be implemented in libomp, we could remove this code. But we need fully functional implementation now, not in the future.

If this should really be fixed in libomp, then we should fix it there. At the very least, there should be a FIXME about removing the workaround once the libomp fix is in place. @AndreyChurbanov, any thoughts on this?

A long time ago we had identified the problem with the loop trip count and if I recall correctly the proposed solution was to store the trip count per OpenMP task. The fix would be implemented in libomp because libomptarget has no notion of tasks.

@AlexEichenberger Can you confirm whether or not this is the case? I may confuse this with something else...

The patch does exactly what you said: stores tripcount per task, but in the libomptarget, not libomp. When it will be implemented in libomp, we could remove this code. But we need fully functional implementation now, not in the future.

If this should really be fixed in libomp, then we should fix it there. At the very least, there should be a FIXME about removing the workaround once the libomp fix is in place. @AndreyChurbanov, any thoughts on this?

Hal, currently target tripcount has nothing to do with libomp. It is stored as part of the device in libomptarget. So, I think, to fix this problem in libomptarget is fine.
To put it into target task allocated memory is completely different problem, which requires some redesign and full support of target tasks. Plus, just like I said, it is not true that libomptarget is thread agnostic. Actually, it is not, since it uses __kmpc_omp_taskwait(NULL, 0) for dependendent target regions. Moreover, it is not thread safe too, because uses not real thread ID, but hardcoded 0 thread ID. It may lead to problems even in a single-threaded environment if the caller thread is not a master thread.

A long time ago we had identified the problem with the loop trip count and if I recall correctly the proposed solution was to store the trip count per OpenMP task. The fix would be implemented in libomp because libomptarget has no notion of tasks.

@AlexEichenberger Can you confirm whether or not this is the case? I may confuse this with something else...

The patch does exactly what you said: stores tripcount per task, but in the libomptarget, not libomp. When it will be implemented in libomp, we could remove this code. But we need fully functional implementation now, not in the future.

If this should really be fixed in libomp, then we should fix it there. At the very least, there should be a FIXME about removing the workaround once the libomp fix is in place. @AndreyChurbanov, any thoughts on this?

Hal, currently target tripcount has nothing to do with libomp. It is stored as part of the device in libomptarget. So, I think, to fix this problem in libomptarget is fine.
To put it into target task allocated memory is completely different problem, which requires some redesign and full support of target tasks.

I think that it makes sense to note this in the code. Can you please add a comment along the lines of (assuming that I'm understanding you correctly):

// NOTE: Once libomp gains full target-task support, this state should be moved into the target task in libomp.

Plus, just like I said, it is not true that libomptarget is thread agnostic. Actually, it is not, since it uses __kmpc_omp_taskwait(NULL, 0) for dependendent target regions. Moreover, it is not thread safe too, because uses not real thread ID, but hardcoded 0 thread ID. It may lead to problems even in a single-threaded environment if the caller thread is not a master thread.

This is the issue to which "I'm going to fix this in the next patch." refers, correct?

A long time ago we had identified the problem with the loop trip count and if I recall correctly the proposed solution was to store the trip count per OpenMP task. The fix would be implemented in libomp because libomptarget has no notion of tasks.

@AlexEichenberger Can you confirm whether or not this is the case? I may confuse this with something else...

The patch does exactly what you said: stores tripcount per task, but in the libomptarget, not libomp. When it will be implemented in libomp, we could remove this code. But we need fully functional implementation now, not in the future.

If this should really be fixed in libomp, then we should fix it there. At the very least, there should be a FIXME about removing the workaround once the libomp fix is in place. @AndreyChurbanov, any thoughts on this?

Hal, currently target tripcount has nothing to do with libomp. It is stored as part of the device in libomptarget. So, I think, to fix this problem in libomptarget is fine.
To put it into target task allocated memory is completely different problem, which requires some redesign and full support of target tasks.

I think that it makes sense to note this in the code. Can you please add a comment along the lines of (assuming that I'm understanding you correctly):

// NOTE: Once libomp gains full target-task support, this state should be moved into the target task in libomp.

Ok.

Plus, just like I said, it is not true that libomptarget is thread agnostic. Actually, it is not, since it uses __kmpc_omp_taskwait(NULL, 0) for dependendent target regions. Moreover, it is not thread safe too, because uses not real thread ID, but hardcoded 0 thread ID. It may lead to problems even in a single-threaded environment if the caller thread is not a master thread.

This is the issue to which "I'm going to fix this in the next patch." refers, correct?

Yes.

A long time ago we had identified the problem with the loop trip count and if I recall correctly the proposed solution was to store the trip count per OpenMP task. The fix would be implemented in libomp because libomptarget has no notion of tasks.

@AlexEichenberger Can you confirm whether or not this is the case? I may confuse this with something else...

The patch does exactly what you said: stores tripcount per task, but in the libomptarget, not libomp. When it will be implemented in libomp, we could remove this code. But we need fully functional implementation now, not in the future.

If this should really be fixed in libomp, then we should fix it there. At the very least, there should be a FIXME about removing the workaround once the libomp fix is in place. @AndreyChurbanov, any thoughts on this?

Hal, currently target tripcount has nothing to do with libomp. It is stored as part of the device in libomptarget. So, I think, to fix this problem in libomptarget is fine.
To put it into target task allocated memory is completely different problem, which requires some redesign and full support of target tasks.

I think that it makes sense to note this in the code. Can you please add a comment along the lines of (assuming that I'm understanding you correctly):

// NOTE: Once libomp gains full target-task support, this state should be moved into the target task in libomp.

Ok.

Thanks!

Plus, just like I said, it is not true that libomptarget is thread agnostic. Actually, it is not, since it uses __kmpc_omp_taskwait(NULL, 0) for dependendent target regions. Moreover, it is not thread safe too, because uses not real thread ID, but hardcoded 0 thread ID. It may lead to problems even in a single-threaded environment if the caller thread is not a master thread.

This is the issue to which "I'm going to fix this in the next patch." refers, correct?

Yes.

ABataev updated this revision to Diff 207778.Jul 3 2019, 6:33 AM

Added note.

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

It's one map per device, each map contains one trip count per OpenMP thread.

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

It's one map per device, each map contains one trip count per OpenMP thread.

Ah, indeed. Nevermind.

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

No. We already have tripcount per device: each device stores its unique tripcount value, but only one. I extended the existing solution to have the tripcount not only per device, but also per host thread.

hfinkel added a comment.EditedJul 3 2019, 2:53 PM

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

No. We already have tripcount per device: each device stores its unique tripcount value, but only one. I extended the existing solution to have the tripcount not only per device, but also per host thread.

Would there be any advantage to using __kmpc_threadprivate_register and __kmpc_threadprivate (or similar) instead of storing a map with thread ids?

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

No. We already have tripcount per device: each device stores its unique tripcount value, but only one. I extended the existing solution to have the tripcount not only per device, but also per host thread.

Would there be any advantage to using __kmpc_threadprivate_register and __kmpc_threadprivate (or similar) instead of storing a map with thread ids?

Hard to tell, but definitely ee increase dependency between libomptarget and libomp. and, seems to me, threadprivate interface also uses some kind of hash table. We can use unordered_map instead.

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

No. We already have tripcount per device: each device stores its unique tripcount value, but only one. I extended the existing solution to have the tripcount not only per device, but also per host thread.

Would there be any advantage to using __kmpc_threadprivate_register and __kmpc_threadprivate (or similar) instead of storing a map with thread ids?

Hard to tell, but definitely ee increase dependency between libomptarget and libomp. and, seems to me, threadprivate interface also uses some kind of hash table. We can use unordered_map instead.

Given that we expect this to be temporary, I'm okay with it either way. unordered_map is probably a little bit better than std::map. I expect the number of entries in the map to be small because __kmpc_global_thread_num returns OpenMP thread ids, not system TIDs, right?

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

No. We already have tripcount per device: each device stores its unique tripcount value, but only one. I extended the existing solution to have the tripcount not only per device, but also per host thread.

Would there be any advantage to using __kmpc_threadprivate_register and __kmpc_threadprivate (or similar) instead of storing a map with thread ids?

Hard to tell, but definitely ee increase dependency between libomptarget and libomp. and, seems to me, threadprivate interface also uses some kind of hash table. We can use unordered_map instead.

Given that we expect this to be temporary, I'm okay with it either way. unordered_map is probably a little bit better than std::map. I expect the number of entries in the map to be small because __kmpc_global_thread_num returns OpenMP thread ids, not system TIDs, right?

Yes.

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

No. We already have tripcount per device: each device stores its unique tripcount value, but only one. I extended the existing solution to have the tripcount not only per device, but also per host thread.

Would there be any advantage to using __kmpc_threadprivate_register and __kmpc_threadprivate (or similar) instead of storing a map with thread ids?

Hard to tell, but definitely ee increase dependency between libomptarget and libomp. and, seems to me, threadprivate interface also uses some kind of hash table. We can use unordered_map instead.

Given that we expect this to be temporary, I'm okay with it either way. unordered_map is probably a little bit better than std::map. I expect the number of entries in the map to be small because __kmpc_global_thread_num returns OpenMP thread ids, not system TIDs, right?

Yes.

SGTM.

hfinkel accepted this revision.Jul 5 2019, 6:23 PM

Added note.

The patch description says that you add a trip count "per thread", but the code is adding a trip count per device. Could this still cause a problem if multiple threads have target regions for the same device?

No. We already have tripcount per device: each device stores its unique tripcount value, but only one. I extended the existing solution to have the tripcount not only per device, but also per host thread.

Would there be any advantage to using __kmpc_threadprivate_register and __kmpc_threadprivate (or similar) instead of storing a map with thread ids?

Hard to tell, but definitely ee increase dependency between libomptarget and libomp. and, seems to me, threadprivate interface also uses some kind of hash table. We can use unordered_map instead.

Given that we expect this to be temporary, I'm okay with it either way. unordered_map is probably a little bit better than std::map. I expect the number of entries in the map to be small because __kmpc_global_thread_num returns OpenMP thread ids, not system TIDs, right?

Yes.

SGTM.

A note on naming, but this also LGTM.

I chatted offline with Johannes, and I agree with him, that longer term we should have a new interface that doesn't require this "push" precall. Regardless, I think that the comment is fine, in that, so long as we have to support this interface (which we might need for backward compatibility even after we add a new interface), it's fine to note where a cleaner location for that state data might be.

libomptarget/src/device.h
101 ↗(On Diff #207778)

While we're changing this, we should also update it to following our naming conventions:

loopTripCnt -> LoopTripCnt
ABataev marked an inline comment as done.Jul 5 2019, 6:38 PM
ABataev added inline comments.
libomptarget/src/device.h
101 ↗(On Diff #207778)

I thin the renaming better to put in another patch, but if you're ok with this, I'll rename it here

hfinkel added inline comments.Jul 5 2019, 6:44 PM
libomptarget/src/device.h
101 ↗(On Diff #207778)

My thinking is: Because we need to change every line of code which references it anyway, changing the name along with changing the type makes sense. To me, I don't think it matters whether you split that into one patch or two. If you think it's cleaner to changing the name in a separate patch from changing the type, that's fine too.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2019, 8:30 AM