This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Code generation for teams - kernel launching
ClosedPublic

Authored by sfantao on Feb 8 2016, 10:03 PM.

Details

Summary

This patch implements the launching of a target region in the presence of a nested teams region, i.e calls tgt_target_teams with the required arguments gathered from the enclosed teams directive.

The actual codegen of the region enclosed by the teams construct will be contributed in a separate patch.

Diff Detail

Event Timeline

sfantao updated this revision to Diff 47295.Feb 8 2016, 10:03 PM
sfantao retitled this revision from to [OpenMP] Code generation for teams - kernel launching.
sfantao updated this object.
ABataev added inline comments.Feb 16 2016, 8:27 PM
lib/CodeGen/CGOpenMPRuntime.cpp
4065–4072

It is better to use OMPCapturedExprDecl for this, just like it is done for schedule clause

4076–4111

Please, do it in separate functions

4091

Return type must be Int32Ty, I think

4108

Also Int32Ty

4447–4460

Again, all this must be done in separate functions

sfantao updated this revision to Diff 48229.Feb 17 2016, 12:08 PM
sfantao marked 4 inline comments as done.
sfantao updated this object.

Separate emission of num_teams and thread_limit into functions.

Hi Alexey,

Thanks for the review!

lib/CodeGen/CGOpenMPRuntime.cpp
4065–4072

I don't think that would completel solve the problem in this case. In my understanding the problem I have here is slightly differetn than the one OMPCapturedExprDecl attempts to solve:

I have a clause (num_teams/thread_limit) that is part of an enclosed directive (teams) that I need to emit in the outer scope (target). If I create a OMPCapturedExprDecl, that would have to go with some dummy clause for the target so that the initializer is emitted at the target lexical scope, and that emission would only work because in most directives the captures are local variables of the enclosing scope, and emission on the locals takes precedence over declaration that "refer to enclosing capture". However, target directive is special in the sense that it also captures global variables. So if I use OMPCapturedExprDecl on a expression that refers to globals that will cause a crash during the emission of the initializer because the capture of the target directive was not created yet. The patch has regression tests exactly to test this subtle difference in the target directive.

I am not saying that there are no other ways of doing this, but this approach seemed to me as the least disruptive as it is self-contained in the target codegen.

Let me know if you disagree.

4076–4111

I centralized the emission of num_teams and thread_limit in a function.

4447–4460

I am now doing the emission of num_teams and thread_limit in a separate function.

ABataev added inline comments.Feb 18 2016, 5:21 AM
lib/CodeGen/CGOpenMPRuntime.cpp
4065–4164

I don't understand why global var is not captured in target region. If it is not implemented yet, it must be implemented. If it is not captured, it must be captured in Sema. We should not introduce some function/other objects to find a workaround for 'not implemented' features.

Hi Alexey,

lib/CodeGen/CGOpenMPRuntime.cpp
4065–4164

Sorry, I was not clear in my comment. It is not that globals are not captured in target regions - they are, we already have Sema doing that.

My point is that exactly because we capture globals in target regions the magic that OMPCapturedExprDecl introduces does not work for that specific case. So, if we have something like:

int Gbl;

foo() {
  #pragma omp target
  #pragma omp teams num_teams(Gbl)
  {}
}

when the DeclRefExpr for Gbl used in num_teams is emitted in the scope that encloses '#pragma omp target', it will crash because Gbl is not a local and is marked as refer to enclosing capture.

If I got it right, a solution based on OMPCapturedExprDecl basically makes local declarations whose initializers are the expression we are interested in. In the cases that OMPCapturedExprDecl is currently employed we don't have globals being captured and that is why it works fine.

It is likely I am missing something here. Let me know if you need me to provide more details.

Thanks!

ABataev added inline comments.Feb 18 2016, 11:22 AM
lib/CodeGen/CGOpenMPRuntime.cpp
4065–4164

It should not crash, because if it is captured, we must use captured version of this variable, passed in arguments to outlined function

Hi Alexey

lib/CodeGen/CGOpenMPRuntime.cpp
4065–4164

I am afraid I may not be understanding what you want me to do. Going back to my example:

int Gbl;

foo() {
  // a) I need to emit num_teams(Gbl) here. DeclRefExpr(Gbl) emission won't work because it is marked "refer to enclosing capture". 
  #pragma omp target
  // b) If I emit it here, that's fine because I already have the arguments of the outlined function, but that is not what I need.
  #pragma omp teams num_teams(Gbl)
  {}
}

Can you please elaborate on how OMPCapturedExprDecl would help me implement a). Sorry for the trouble.

Thanks!

ABataev added inline comments.Feb 19 2016, 8:06 PM
lib/CodeGen/CGOpenMPRuntime.cpp
4065–4164

Ok, why you don't want to emit it in b), but in a)?

sfantao added inline comments.Feb 20 2016, 9:52 AM
lib/CodeGen/CGOpenMPRuntime.cpp
4065–4164

The reason is that the runtime library requires the number of teams and thread limit to be passed. So, if we have a target region with an enclosed teams region, we have to use tgt_target_teams instead of tgt_target. tgt_target_teams takes thread_limit and num_teams as arguments. Therefore, we need to get that information from the teams directive given that that information is captured in its clauses.

Thanks!

Just wanted to add that tgt_target_teams needs the values for num_teams and thread_limit because, for some accelerators, it is necessary to know those values in advance, before teams gets actually executed. For instance, on Nvidia GPUs we launch one CUDA block for each team. This can only be done at kernel launch time, which is performed in the implementation of tgt_target_teams.

sfantao updated this revision to Diff 48839.Feb 23 2016, 11:13 AM
sfantao edited edge metadata.

Rebase.

ABataev added inline comments.Feb 25 2016, 7:38 PM
lib/CodeGen/CGOpenMPRuntime.cpp
4073

I still don't like the generation of some functions, that, generally speaking, are not required.
Could you try to add a new 'class CGOpenMPInlinedRegionInfo' like class, that will be able to handle not captured variables in expressions?

sfantao updated this revision to Diff 49331.Feb 28 2016, 4:39 PM
sfantao marked an inline comment as done.

Emit num teams and thread limit using the inlined directives machinery.

lib/CodeGen/CGOpenMPRuntime.cpp
4073

Ok. In the the new diff I am using the logic for emission of inlined regions for num teams and thread limit as well. I still had to add extra logic in CodeGenFunction to insert extra entries in the local declaration cache, given that, as I discussed above, target regions make captured global variables local. Hope this is aligned with what you have in mind.

Thanks!

ABataev added inline comments.Mar 2 2016, 3:26 AM
lib/CodeGen/CGOpenMPRuntime.cpp
327–350

Do not modify this one, add a new one like this:

static void EmptyCodeGen(CodeGenFunction &) {
  llvm_unreachable("No codegen for expressions");
}
/// \brief API for generation of expressions captured in OpenMP region in outer
/// scope.
class CGOpenMPOuterExprInfo : public CGOpenMPInlinedRegionInfo {
public:
  CGOpenMPOuterExprInfo(CodeGenFunction &CGF)
      : CGOpenMPInlinedRegionInfo(CGF.CapturedStmtInfo, CodeGen, OMPD_unknown,
                                  /*HasCancel=*/false),
        CGF(CGF) {}

  /// \brief Lookup the captured field decl for a variable.
  const FieldDecl *lookup(const VarDecl *VD) const override {
    if (auto *FD = CGOpenMPInlinedRegionInfo::lookup(VD))
      return FD;
    if (!VD->isLocalVarDeclOrParm() && Mapped.count(VD) == 0) {
      auto It =
          PrivateGlobals.insert(new CodeGenFunction::OMPPrivateScope(CGF));
      DeclRefExpr DRE(const_cast<VarDecl *>(VD),
                      /*RefersToEnclosingVariableOrCapture=*/false,
                      VD->getType().getNonReferenceType(), VK_LValue,
                      SourceLocation());
      It.first->addPrivate(
          VD, [&]() -> Address { return CGF.EmitLValue(&DRE).getAddress(); });
      (void)It.first->Privatize();
      Mapped.insert(VD);
    }
    return nullptr;
  }

  /// \brief Emit the captured statement body.
  void EmitBody(CodeGenFunction &CGF, const Stmt *S) override {
    llvm_unreachable("No body for expressions");
  }

  /// \brief Get a variable or parameter for storing global thread id
  /// inside OpenMP construct.
  const VarDecl *getThreadIDVariable() const override {
    llvm_unreachable("No thread id for expressions");
  }

  /// \brief Get the name of the capture helper.
  StringRef getHelperName() const override {
    llvm_unreachable("No helper name for expressions");
  }

  static bool classof(const CGCapturedStmtInfo *Info) {
    llvm_unreachable("No helper name for expressions");
  }

  virtual ~CGOpenMPOuterExprInfo() {
    for (auto *Scope : PrivateGlobals)
      delete Scope;
    PrivateGlobals.clear();
  }

private:
  CodeGenFunction &CGF;
  /// Private scopes for each captured global variables.
  llvm::SmallPtrSet<CodeGenFunction::OMPPrivateScope *, 4> PrivateGlobals;
  SmallSet<VarDecl *> Mapped;
};
sfantao updated this revision to Diff 49653.Mar 2 2016, 11:18 AM

Use new innermost scope API for the emission of the num_teams and thread_limit expressions.

lib/CodeGen/CGOpenMPRuntime.cpp
327–350

Ok, I adapted the code you pasted above and I am now creating a new inline region API. I am naming it CGOpenMPInnerExprInfo given that it relates to the emission of expression defined in the inner scope. Also, I am doing the privatization in the constructor given that the globals have to be local already by the time the expression is emitted.

ABataev added inline comments.Mar 2 2016, 10:10 PM
lib/CodeGen/CGOpenMPRuntime.cpp
49–51

Do we really need this one? I don't think it will be used in codegen for directives, so do not add it.

316–318

I think it will be enough just to return 'false' here always, it should not be used in any casting operations ever

4438–4440

What if ThreadLimit is 'nullptr'? And why it cannot be 'nullptr' if NumTeams is not 'nullptr'?

lib/CodeGen/CGStmtOpenMP.cpp
2712

Use 'OMPLexicalScope Scope(*this, S);' instead.

sfantao updated this revision to Diff 49712.Mar 2 2016, 10:44 PM
sfantao marked 4 inline comments as done.
  • Remove InnerInlineRegion Kind. Improve comments and other two minor edits.

Hi Alexey,

Thanks for the review!

lib/CodeGen/CGOpenMPRuntime.cpp
49–51

Ok, I removed it. I was just following what was being done for the other APIs.

316–318

Ok, done.

4438–4440

Both values should be defined if there is a nested teams directive. If there are no num_teams or thread_limit clauses (but we have a team directive), those values will be defined with a int32 constant zero, which is the default value for the runtime library. So, no matter the clauses, if there is a teams directive both values will be defined. So, it is safe to assume that both values will either be defined or both null.

I added a comment to clarify that.

lib/CodeGen/CGStmtOpenMP.cpp
2712

Done!

ABataev accepted this revision.Mar 2 2016, 11:25 PM
ABataev edited edge metadata.

LG

This revision is now accepted and ready to land.Mar 2 2016, 11:25 PM
sfantao closed this revision.Mar 3 2016, 8:25 AM