Page MenuHomePhabricator

[OpenMP] Begin Adding Analysis Remarks for OpenMP Best Practises.
Needs ReviewPublic

Authored by jhuber6 on Jun 2 2020, 2:43 PM.

Details

Reviewers
jdoerfert
Summary

Adding analysis remarks to guide users to OpenMP best practices. The goal will to have a suite of analysis passes that can detect common mistakes in OpenMP code. This currently only checks if the user has specified a parallel region with a constant thread count.

Diff Detail

Unit TestsFailed

TimeTest
1,130 mslibomp.lock::omp_init_lock.c
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -I /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang/llvm-project/openmp/runtime/test -I /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang/llvm-project/build/projects/openmp/runtime/src -L /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang/llvm-project/build/lib -I /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang/llvm-project/openmp/runtime/test/ompt /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang/llvm-project/openmp/runtime/test/lock/omp_init_lock.c -o /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang/llvm-project/build/projects/openmp/runtime/test/lock/Output/omp_init_lock.c.tmp -lm -latomic && /mnt/disks/ssd0/agent/workspace/amd64_debian_testing_clang/llvm-project/build/projects/openmp/runtime/test/lock/Output/omp_init_lock.c.tmp

Event Timeline

jhuber6 created this revision.Jun 2 2020, 2:43 PM

Cool! Some minor comments below.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
423

Early exits are usually easier to read, thus:

if (!UV)
 return;

Maybe we should be more descriptive, telling people that fixed numbers are not future compatible. *And* we should not issue the remark if we find a worksharing loop with a constant trip count (that is a multiple of this constant). I guess the latter part can just be a TODO for now, maybe also a test case:

#pragma omp parallel for num_threads(8)
for (int i = 0; i < 8; ++i)
  body(i);

That reminds me that we should determine the loop trip count for worksharing loops as well. We could even use that value to feed it into "set_num_threads" to avoid waking more threads than needed.

llvm/test/Transforms/OpenMP/constant_thread_count_analysis.ll
120

Don't think we need this comment ;)

jhuber6 marked an inline comment as done.Jun 2 2020, 6:22 PM
jhuber6 added inline comments.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
423

Might be helpful, something like "Use of OpenMP parallel region with constant number of threads is not future compatibile. Consider a dynamic thread count instead."

And are there any general ways to extract the loop tripcount in LLVM? Depending on where you are in the optimization the loop could take many different forms as far as I'm aware.

jhuber6 updated this revision to Diff 268173.Jun 3 2020, 6:41 AM

Making minor adjustments.

Any recommendations on which best practices to target next? I'm not intimately familiar with OpenMP best practices so I'm not sure which things would qualify.

Bikeshedding: do we really want to do this here?
If we are not worried about the cases where the thread count was a variable originally,
but the variable eventually got folded into a constant,
this can be trivially done as a clang-tidy check, or clang diag.

Making minor adjustments.

Any recommendations on which best practices to target next? I'm not intimately familiar with OpenMP best practices so I'm not sure which things would qualify.

We can do one that "in theory" we will optimize soon:
Look for pointer arguments of the outlined parallel body function that are marked readonly (except the first two). For those, tell people that these should be passed by value via firstprivate.

You can also look for a __kmpc_barrier just before the end of a parallel region, e.g., in

#pragma omp parallel
{
   some code
   #pragma omp for
   for (...)
      ...
}

I also think we should improve the parallel annotation in a follow up step, that is, look at the body and avoid remarks if the usage of a constant thread count seems reasonable. There is also the case where the constant thread count might be reasonable but we can ask for further information, e.g.,

#pragma omp parallel for num_threads(4)
for (int i = 0; i < N; ++i)
  ...

We could tell the user that a constant thread count with an unknown loop trip count might not be what they want. Either make the thread count variable (or remove it), or provide information about the loop bound via operations like __builtin_assume(N % 4 == 0), or __builtin_assume(N = 4 * k).

Bikeshedding: do we really want to do this here?
If we are not worried about the cases where the thread count was a variable originally,
but the variable eventually got folded into a constant,
this can be trivially done as a clang-tidy check, or clang diag.

There are obviously pros and cons to the location to do this. Let me start by saying the thread count thing is just a simple starter, the idea is to allow opt-in analysis that try to provide way more insights into the program. Later I also want to modify the code and provide feedback at runtime or based on profiling data.
That said, there are other reasons to put it here, e.g., Fortran ;)
The variable vs constant thing could be checked in the frontend but it is questionable if the following cases are really different, I mean if we don't want constant prop to happen first or not,

void foo1() {
  #pragma omp parallel num_threads(8)
  {}
}
void foo2() {
  #pragma omp parallel num_threads(NumThreads)
  {]
}
void foo3(unsigned NT = 8) {
  #pragma omp parallel num_threads(NT)
  {]
}
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
423

"Consider using a scalable thread count." + ", or remove the num_threads clause.

423
And are there any general ways to extract the loop tripcount in LLVM? Depending on where you are in the optimization the loop could take many different forms as far as I'm aware.

There is, and yes loops have different forms. The usual way is to ask ScalarEvolution for the trip count, however, that won't work for worksharing loops.

Though, worksharing loops are actually easy to predict. Look for the __kmpc_for_init (or similar) call and follow the lower and upper bound pointers. They should be written exactly once, and the difference between those two values is the trip count, at least for clang as we do normalize the loops. More general we also need to follow the stride argument and do some slightly more complex computation but it is still doable. You can even use ScalarEvolution to represent the bounds, stride, and difference. That way allows you to also create an expression representing the value into the code (if that is needed later), and it will simplify the expression if possible, e.g., constant fold it.

jhuber6 marked an inline comment as done.Jun 4 2020, 9:58 AM
jhuber6 added inline comments.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
423

The loop body where __kmpc_for_init is called is inside the callback function, which might also be separated by another funciton. I'm guessing the way to handle this would to start with the function where there's a __kmpc_init call and go backwards in the gall graph until you find the function that contains the __kmpc_fork call and check if there's a corresponding __kmpc_push_num_threads call. If there is and the found trip count isn't a multiple of the constant thread count, emit a message. Is that about right? I haven't seen any examples that show traversing the call graph in this way, just iterating over all the functions.

jhuber6 updated this revision to Diff 269370.Jun 8 2020, 3:43 PM

Small changes and adding support for checking use of shared for variables that should be pass-by-value.

jhuber6 marked 2 inline comments as done.Jun 8 2020, 3:45 PM
jhuber6 added inline comments.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
433

I originally had this function take a function, find the fork_call or fork_teams, and return the callback argument if it existed. I had to take the logic out because otherwise I didn't have an instruction to get a debugloc for.

This doesn't handle debug options where the compiler will wrap the parallel region in another function.

llvm/test/Transforms/OpenMP/shared_firstprivate_analysis.ll
126

Should I combine this into one giant file that just handles OpenMP Analysis stuff?

Thanks for the update! With the second analysis we can show why the frontend is not necessarily the perfect place for this, determining readonly happens naturally in the middle-end. (Not to mention with the analysis for sufficient conditions to know a transformation would be sound.).

I left a bunch of comments, nothing major though.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
163

I doubt these need to return a value.

Can we check if remarks are enabled and only run them in case they are?

350

Why an X-value (&&) and not just a reference (&)?

466

Nit: -;
Hoist RFI out of the loop.
Don't we have a helper to do something "foreach call"... we should.

I would not call it Callback but ParallelRegionFn or similar. Also use proper types not auto when it is not clear and helpful, e.g., for Callback.

I'd add the word "probably" or "very likely" in the message as it is only known to be the case if the argument also is not modified via a different pointer. (For example, only if it is noalias as well we know it is not otherwise modified and can do the transformation ourselves.)

llvm/test/Transforms/OpenMP/shared_firstprivate_analysis.ll
14

We need to reference the value somehow. Where does this point to now, I mean what is in line 7? Could we include the C source in a comment at the top? We probably want to issue 2 remarks, one for the value, with the value debug info, and one for the directive.

126

If there is no inherent benefit I would not merge them. Having more selective tests is usually better. You can prefix their names to make it easier to categorize them (instead of the current postfix), or put them into a subfolder.

jhuber6 marked 3 inline comments as done.Jun 8 2020, 7:49 PM
jhuber6 added inline comments.
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
163

I think there's an option for checking in here but I haven't tried yet,
https://llvm.org/doxygen/classllvm_1_1DiagnosticInfoOptimizationBase.html

350

As far as I know, the Remark is being generated here as an r-value so the callback function can either copy it or move it, but can't take a reference to it since it has no location in memory. An r-value could bind to a const reference but the emitter needs to be mutated to build the remark.

ORE.emit([&]() {
  return RemarkCB(RemarkKind(DEBUG_TYPE, RemarkName, Inst));
});

I could change this to generate a RemarkKind inside the function and allow a generic reference, but it would pretty much just change whether or not the object is constructed here or when emit calls the lambda function. That's at least my understanding, I'm not an r-value l-value guru.

466

There's the "foreachUse" function but I wasn't sure if it applied here.

jhuber6 marked an inline comment as done.Jun 8 2020, 8:08 PM
jhuber6 added inline comments.
llvm/test/Transforms/OpenMP/shared_firstprivate_analysis.ll
14

I can go ahead and add the source. To print out the value it refers to I'd probably need to the get the value from the __kmpc_fork_call since the remark is generated from looking at the arguments to the .omp_outlined function which doesn't know anything about the actual value. Is it safe to assume that the arguments are in the same order here? Like the third argument to .omp_outlined is always the first vararg passed to the callback?

jhuber6 updated this revision to Diff 269935.Jun 10 2020, 12:09 PM

Small changes, added check for if remarks were enabled. Added calculation of loop trip count given an __kmpc_for_static_init call. Waiting on ICV tracking support to associate correct parallel regions before moving forward.

jdoerfert added inline comments.Jun 10 2020, 6:21 PM
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
147

Nit: Documentation, also all the functions below please.

350

It's fine, I didn't check this deep, just from the diff I figured I ask.

405

Style: I'd get rid of the braces, makes it (for me) harder to argue if the return below is inside the lambda or not.

415

I think the convention is llvm::None, or something similar. Above, just return upper... so we can also remove the braces. While at it, remove the else (no else after return).

420

Somewhere in here we did:

const unsigned CallbackCalleeOperand = 2;

To give the magic number 2 a description. Maybe move it into class scope or duplicate it.

444

I think we should copy the pointer C by value here, [=] or [C].