Refactored code of dependence processing and added new inoutset dependence type.
Compiler can set dependence flag to 0x8 when call __kmpc_omp_task_with_deps.
All dependence flags library gets so far and corresponding dependence types: 
1 - IN, 2 - OUT, 3 - INOUT, 4 - MUTEXINOUTSET, 8 - INOUTSET.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I've found a problem with this patch: cannot treat task dependence flags as an integer because it seems clang defines it as 1 byte. Current specification has 6 kinds of task dependences, so I think we'd better define the struct as 4 bytes just in case. I'll fix the patch for now.
Changed the size of task dependences flags from 8 bit to 32 bit, so that runtime does not get garbage data in unused bits of the structure, and can use flags as an integer where it is more convenient than looking at particular bits.
Also adjusted some tests to clear unused bits of the task dependences structure.
Can you add the inoutset case to the ompt code as well? omp-tools.h already defines ompt_dependence_type_inoutset.
| openmp/runtime/src/kmp_taskdeps.cpp | ||
|---|---|---|
| 344–349 | If I understand this code right, this has O(n^2) complexity for two sets of size n? Consider: for (int i=0; i<n; i++) {
#pragma omp task depend(in:A)
work(A,i);
}
for (int i=0; i<n; i++) {
#pragma omp task depend(inoutset:A)
work(A,i);
}All n tasks in the first loop would be predecessor for each of the tasks in the second loop. This will also result in n^2 releases/notifications. I'd suggest to model the dependencies like: for (int i=0; i<n; i++) {
#pragma omp task depend(in:A)
work(A,i);
}
#pragma omp taskwait depend(inout:A) nowait
for (int i=0; i<n; i++) {
#pragma omp task depend(inoutset:A)
work(A,i);
}I.e., add a dummy dependency node between the two sets, which has n predecessors and n successors. You probably understand the depnode code better, than I do, but I think you would only need to add some code in line 357, where last_set is moved to prev_set. | |
| openmp/runtime/src/kmp_taskdeps.cpp | ||
|---|---|---|
| 344–349 | This could be a separate research project. Because such change may cause performance regressions on some real codes, so it needs thorough investigation. I mean insertion of an auxiliary dummy node into dependency graph is definitely an overhead, which may or may not be overridden by reduction in number of edges in the graph. Or it can be made optional optimization under an external control, if there is a significant gain in some particular case. | |
| openmp/runtime/src/kmp_taskdeps.cpp | ||
|---|---|---|
| 344–349 | I don't suggest to insert the auxiliary node in the general case. Just for the case that you see a transition of in -> inoutset or vice versa. With current task dependencies, you always have 1 node notifying n nodes (out -> in) or n nodes notifying one node (in -> out). You can see this as an amortized O(1) operation per task in the graph. Here you introduce a code pattern, where n nodes each will need to notify m other nodes. This leads to an O(n) operation per task. I'm really worried about the complexity of this pattern. | |
| openmp/runtime/src/kmp_taskdeps.cpp | ||
|---|---|---|
| 344–349 | 
 No, I don't introduce the pattern, as it already worked for sets of tasks with in dependency following set of tasks with mutexinoutset dependency or vice versa. 
 I didn't like that inoutset was introduced as the clone of "in" in order to separate two (big) sets of mutually independent tasks, but this has already been added to specification, so we have to implement it. Of cause your suggestion can dramatically speed up dependency processing of some trivial case with two huge sets of tasks one wit in dependency another with inoutset; but it slightly changes the semantics of user's code, and actually can slowdown other cases. I would let users do such optimizations. Or compiler can do this once it sees big sets of tasks those don't have barrier-like "taskwait nowait". For user this is one line of code, for compiler this is dozen lines of code, for runtime library this is pretty big change which does not worth efforts to implement, I think. Especially given that such "optimization" will definitely slowdown some cases. E.g. small set of tasks with in dependency can be followed single task with inoutset in a loop. Why should we slowdown this case? Library has no idea of the structure of user's code. In general, I don't like the idea when runtime library tries to optimize user's code. Especially along with other changes in the same patch. | |
| openmp/runtime/src/kmp_taskdeps.cpp | ||
|---|---|---|
| 344–349 | 
 Ok, thanks for the clarification. I missed that mutexinoutset has the same issue. 
 I get your point. Ok, then. | |
| openmp/runtime/src/kmp_taskdeps.cpp | ||
|---|---|---|
| 418 | I hit this assertion, when compiling the tests with clang-11. Is this expected behavior? Does this patch break backwards compatibility, or should this assertion just not look at the higher bits, as they might be uninitialized? Whey I change dep_list[i].flag to (dep_list[i].flag&0xf), the assertion doesn't trigger. | |
| openmp/runtime/src/kmp_taskdeps.cpp | ||
|---|---|---|
| 418 | Thanks for pointing this out. I've reverted the patch in order to fix backwards compatibility problem. Will eliminate the dependence flag type size change in clang, and change the size in the library code instead. | |
Fixed backwards compatibility problem introduced by previous version of the patch.
That is restored the size of task dependence flag to 8 bits in clang, and instead changed the size of the flag in the library from 32 to 8 bits.
Fixed two tests so that they set all 8 bits of the flag.
If I understand this code right, this has O(n^2) complexity for two sets of size n?
Consider:
for (int i=0; i<n; i++) { #pragma omp task depend(in:A) work(A,i); } for (int i=0; i<n; i++) { #pragma omp task depend(inoutset:A) work(A,i); }All n tasks in the first loop would be predecessor for each of the tasks in the second loop. This will also result in n^2 releases/notifications.
I'd suggest to model the dependencies like:
for (int i=0; i<n; i++) { #pragma omp task depend(in:A) work(A,i); } #pragma omp taskwait depend(inout:A) nowait for (int i=0; i<n; i++) { #pragma omp task depend(inoutset:A) work(A,i); }I.e., add a dummy dependency node between the two sets, which has n predecessors and n successors. You probably understand the depnode code better, than I do, but I think you would only need to add some code in line 357, where last_set is moved to prev_set.
This dummy node would reduce linking and releasing complexity to O(n).