This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP][OpenMP] Emit deferred diagnostics by a post-parsing AST travese
ClosedPublic

Authored by yaxunl on Nov 13 2019, 4:35 AM.

Details

Summary

This patch removes the explicit call graph for CUDA/HIP/OpenMP deferred diagnostics generated during parsing
since it is error prone due to incomplete information about function declarations during parsing. In stead,
this patch does a post-parsing AST traverse and emits deferred diagnostics based on the use graph implicitly
generated during the traverse.

Diff Detail

Event Timeline

yaxunl created this revision.Nov 13 2019, 4:35 AM
tra added a subscriber: rnk.Nov 13 2019, 10:28 AM

Calling @rnk for Windows know-how.

clang/test/SemaCUDA/deleting-dtor.cu
45–46 ↗(On Diff #229058)

Nit: I think it should be requires deleting dtor to be emitted or requires that deleting dtor is emitted

rjmccall added a subscriber: rsmith.

This seems like the wrong approach; @rsmith should take a look.

yaxunl removed 1 blocking reviewer(s): rsmith.Nov 13 2019, 11:22 AM
yaxunl added 1 blocking reviewer(s): rsmith.Nov 13 2019, 6:37 PM

sorry I think I misunderstood the meaning of "blocking" so I put it back.

rnk added a comment.Nov 14 2019, 9:51 AM

Are we sure using both Itanium and MS C++ ABIs at the same time is really the best way forward here? What are the constraints on CUDA that require the Itanium ABI? I'm sure there are real reasons you can't just use the MS ABI as is, but I'm curious what they are. Was there some RFC or design showing that this is the right way forward?

I wonder if it would be more productive to add new, more expansive attributes, similar to __attribute__((ms_struct)), that tag class or function decls as MS or Itanium C++ ABI. CUDA could then leverage this as needed, and it would be much easier to construct test cases for MS/Itanium interop. This is an expansion in scope, but it seems like it could be generally useful, and if we're already going to enter the crazy world of multiple C++ ABIs in a single TU, we might as well bite the bullet and do it in a way that isn't specific to CUDA.

In D70172#1745998, @rnk wrote:

Are we sure using both Itanium and MS C++ ABIs at the same time is really the best way forward here? What are the constraints on CUDA that require the Itanium ABI? I'm sure there are real reasons you can't just use the MS ABI as is, but I'm curious what they are. Was there some RFC or design showing that this is the right way forward?

I wonder if it would be more productive to add new, more expansive attributes, similar to __attribute__((ms_struct)), that tag class or function decls as MS or Itanium C++ ABI. CUDA could then leverage this as needed, and it would be much easier to construct test cases for MS/Itanium interop. This is an expansion in scope, but it seems like it could be generally useful, and if we're already going to enter the crazy world of multiple C++ ABIs in a single TU, we might as well bite the bullet and do it in a way that isn't specific to CUDA.

We are not using Itanium ABI when we do host compilation of CUDA/HIP on windows. During the host compilation on windows only MS C++ ABI is used.

This issue is not due to mixing MS ABI with Itanium ABI.

This issue arises from the delayed diagnostics for CUDA/HIP. Basically we do not want to emit certain diagnostics (e.g. error in inline assembly code) in __host__ __device__ functions to avoid clutter. We only want to emit such diagnostics once we are certain these functions will be emitted in IR.

To implement this, clang maintains a call graph. For each reference to a function, clang checks the current context. If it is evaluating context and it is a function, clang assumes the referenced function is callee and its context is the caller. Clang checks if the caller is known to be emitted (if it has body and external linkage). If not, clang adds this caller/callee pair to the call graph. If the caller is known to be emitted, clang will check if the callee is known to be emitted. If so, do nothing. If the callee is not known to be emitted, clang will eliminate it and all its callee from the call graph, and emits the delayed diagnostics associated with them.

You can see a caller is added to the call graph only if it is not known to be emitted. Therefore clang has an assert that if a callee is known to be emitted, it should not be in the call graph.

On windows, when vtable is known to be emitted for a class, clang does a body check for dtor of the class. It makes the dtor as the context, then checks the dtor. I think it is to emulate the situation that a deleting dtor is calling a normal dtor. This happens if the dtor is not defined since otherwise the dtor has already been checked. Since dtor is not defined yet, it is not known to be emitted and put into call graph. Later on, if the dtor is defined, it will be checked again. This time it is known to be emitted, then clang finds that it is in the call graph, then the assert fails.

So the issue is that clang incorrectly assume the dtor is not known to be emitted in the first check and put it in the call graph. To fix that, a map is added to Sema to tell clang that it is checking a deleting dtor which is supposed to be emitted even if it is not defined.

yaxunl updated this revision to Diff 232197.Dec 4 2019, 1:33 PM
yaxunl marked 2 inline comments as done.

remove unnecessary states added to Sema.

clang/test/SemaCUDA/deleting-dtor.cu
45–46 ↗(On Diff #229058)

fixed

rnk added a comment.Dec 4 2019, 3:24 PM

We are not using Itanium ABI when we do host compilation of CUDA/HIP on windows. During the host compilation on windows only MS C++ ABI is used.

This issue is not due to mixing MS ABI with Itanium ABI.
...

I think I might have understood all that.

Really, the problem is that, in C++, there are many kinds of special members created by the compiler that are not modeled in the AST. Deleting destructors are a good example. If we consistently used GlobalDecl throughout Sema, then we would be able to separate marking the deleting destructor referenced from marking the base destructor referenced, and this code would be easier to understand.

However, given the way things stand, your new approach seems like a reasonable way of detecting the case of referencing the deleting dtor here. So from my perspective, this is fine. @rjmccall, assuming that Richard doesn't have time to give any input, do you still think this needs his review?

Richard is definitely our main expert in the implicit synthesis of special members. It seems to me that if we need the destructor declaration at some point, we should be forcing it to exist at that point.

yaxunl added a comment.Jan 7 2020, 1:52 PM

Richard is definitely our main expert in the implicit synthesis of special members. It seems to me that if we need the destructor declaration at some point, we should be forcing it to exist at that point.

In AST there are no separate decls for deleting dtors and complete object dtors. In AST there are only complete object dtors. In codegen when clang emits the definition of a deleting dtor, clang uses GlobalDecl with Dtor_Deleting. However AST does not have that.

Since a deleting dtor is supposed to call a complete object dtor, clang needs to check the complete object dtor in the context of the deleting dtor. Since deleting dtor is synthesized in codegen and does not have a body, clang manually pushed the decl of the complete object dtor as context and checks the same complete object dtor.

One may consider using GlobalDecl to differentiate complete object dtor and deleting dtor in AST. However that requires to use GlobalDecl to replace Decl in many places in Sema, which seems to be an overkill.

Fortunately, we could identify the deleting dtor by context without using GlobalDecl.

There are two cases :

  1. There is no definition of complete object dtor,

When clang checks a dtor, if the caller is itself and the caller has no definition. This can only happen when clang checks the deleting dtor. Clang should just assumes the dtor is emitted. Since the dtor has no definition, there is no deferred diagnostics emitted. Clang just add a call graph branch dtor->dtor to the call graph. There is no deferred diagnostics happening with the dtor since the deleting dtor only calls complete object dtor and deallocating functions which are not supposed to cause diagnostics.

Later, if the dtor is called in other functions and checked, since the caller is not itself, it is treated as a normal function, i.e., whether it is emitted is determined by whether it has definition. Since the deleting dtor does not have extra deferred diagnostics compared with complete object dtor, there is no need to differentiate whether the callee is deleting dtor or complet object dtor.

If the complete object dtor is defined, its callees and deferred diagnostics happening in its body will be recorded as normal functions. If the complete object dtor or deleting dtor is called by other functions, the deferred diagnotics of the complete object dtor will be emitted.

  1. There is definition of complete object dtor.

Clang will not check the deleting dtor. In this case the complete object dtor will be checked as a normal function. As discussed in case 1, deleting dtor should result in the same deferred diagnotics as complete object dtor, therefore there is no need to differentiate call of deleting dtor and complete object dtor.

I thought you were saying that the destructor decl hadn't been created yet, but I see now that you're saying something more subtle.

CurContext is set to the destructor because the standard says in [class.dtor]p13:

At the point of definition of a virtual destructor (including an implicit definition), the non-array deallocation function is determined as if for the expression `delete this` appearing in a non-virtual destructor of the destructor’s class.

Which is to say that, semantically, the context is as if it were within the destructor, to the extent that this affects access control and so on.

I can see why this causes problems for your call graph (really a use graph), since it's a use in the apparent context of the destructor at a point where the destructor is not being defined. A similar thing happens with default arguments, but because we don't consider uses from default arguments to be true ODR-uses until the default argument is used, that probably doesn't cause problems for you.

I don't think the destructor -> deallocation function edge is actually interesting for your use graph. It'd be more appropriate to treat the deallocation function as used by the v-table than by the destructor; I don't know whether you make any attempt to model v-tables as nodes in your use graph. You might consider finding a simple way to suppress adding this edge, like just not adding edges from a destructor that's not currently being defined (D->willHaveBody()).

With all that said, maintaining a use graph for all the functions you might emit in the entire translation unit seems very expensive and brittle. Have you considered doing this walk in a final pass? You could just build up a set of all the functions you know you're going to emit and then walk their bodies looking for uses of lazy-emitted entities. If we don't already have a function that calls a callback for every declaration ODR-used by a function body, we should.

This doesn't look quite right to me. I don't think we should treat the delete this; for a destructor as being emitted-for-device in any translation unit in which the vtable is marked used. (For example, if in your testcase MSEmitDeletingDtor::CFileStream::CFileStream() were a __host__ function, I think you'd still diagnose, but presumably shouldn't do so, because the vtable -- and therefore CFileStream::operator delete -- is never referenced / emitted for the device.) Instead, I think we should treat the delete this; as being emitted in any translation unit in which the vtable itself is emitted-for-device. Presumably, this means you will need to model / track usage of the vtable itself in your "call graph".

I thought you were saying that the destructor decl hadn't been created yet, but I see now that you're saying something more subtle.

CurContext is set to the destructor because the standard says in [class.dtor]p13:

At the point of definition of a virtual destructor (including an implicit definition), the non-array deallocation function is determined as if for the expression `delete this` appearing in a non-virtual destructor of the destructor’s class.

Which is to say that, semantically, the context is as if it were within the destructor, to the extent that this affects access control and so on.

I can see why this causes problems for your call graph (really a use graph), since it's a use in the apparent context of the destructor at a point where the destructor is not being defined. A similar thing happens with default arguments, but because we don't consider uses from default arguments to be true ODR-uses until the default argument is used, that probably doesn't cause problems for you.

I don't think the destructor -> deallocation function edge is actually interesting for your use graph. It'd be more appropriate to treat the deallocation function as used by the v-table than by the destructor; I don't know whether you make any attempt to model v-tables as nodes in your use graph. You might consider finding a simple way to suppress adding this edge, like just not adding edges from a destructor that's not currently being defined (D->willHaveBody()).

With all that said, maintaining a use graph for all the functions you might emit in the entire translation unit seems very expensive and brittle. Have you considered doing this walk in a final pass? You could just build up a set of all the functions you know you're going to emit and then walk their bodies looking for uses of lazy-emitted entities. If we don't already have a function that calls a callback for every declaration ODR-used by a function body, we should.

The deferred diagnostic mechanism is shared between CUDA/HIP and OpenMP. The diagnostic messages not only depend on the callee, but also depend on the caller, the caller information needs to be kept. Also if a caller is to be emitted, all the deferred diagnostics associated with the direct or indirect callees need to be emitted. Therefore a call graph is needed for this mechanism.

If we ignore the dtor->deallocation edge in the call graph, we may miss diagnostics, e.g.

static __device__ __host__ void f(__m256i *p) {
  __asm__ volatile("vmovaps  %0, %%ymm0" ::"m"(*(__m256i *)p)
                 : "r0"); // MS-error{{unknown register name 'r0' in asm}}
}
struct CFileStream {
  void operator delete(void *p) {
    f(0);  // MS-note{{called by 'operator delete'}}
  }
  CFileStream();
  virtual ~CFileStream();  // MS-note{{called by '~CFileStream'}}
};

struct CMultiFileStream {
  CFileStream m_fileStream;
  ~CMultiFileStream();
};

// This causes vtable emitted so that deleting dtor is emitted for MS.
CFileStream::CFileStream() {}

Assuming the host compilation is on windows.

Here f() is a host device function which is unknown to be emitted, therefore the inline assembly error results in a delayed diagnostic. When f() is checked in the delete operator body, a 'delete operator -> f' edge is added to the call graph since f() is unknown to be emitted.

Since CFileStream::CFileStream is defined, clang sets vtbl to be emitted and does an explicit dtor check even though dtor is not defined. clang knows that this dtor check is for deleting dtor and will check delete operator as referenced, which causes `dtor -> delete operator' to be added to the call graph. Then clang checks dtor as referenced. Since deleting dtor will be emitted together with vtbl, clang should assume dtor is to be emitted. Then clang will found the callees 'delete operator' and f(), and emits the delayed diagnostics associated with them.

If we do not add 'dtor -> delete operator' edge to the call graph, the diagnostic msg in f() will not be emitted.

yaxunl updated this revision to Diff 237122.Jan 9 2020, 10:23 AM

Add tests for device compilation.

Add a test when both vtbl and deleting dtor are emitted with diagnostic due to delete operator.

I thought you were saying that the destructor decl hadn't been created yet, but I see now that you're saying something more subtle.

CurContext is set to the destructor because the standard says in [class.dtor]p13:

At the point of definition of a virtual destructor (including an implicit definition), the non-array deallocation function is determined as if for the expression `delete this` appearing in a non-virtual destructor of the destructor’s class.

Which is to say that, semantically, the context is as if it were within the destructor, to the extent that this affects access control and so on.

I can see why this causes problems for your call graph (really a use graph), since it's a use in the apparent context of the destructor at a point where the destructor is not being defined. A similar thing happens with default arguments, but because we don't consider uses from default arguments to be true ODR-uses until the default argument is used, that probably doesn't cause problems for you.

I don't think the destructor -> deallocation function edge is actually interesting for your use graph. It'd be more appropriate to treat the deallocation function as used by the v-table than by the destructor; I don't know whether you make any attempt to model v-tables as nodes in your use graph. You might consider finding a simple way to suppress adding this edge, like just not adding edges from a destructor that's not currently being defined (D->willHaveBody()).

With all that said, maintaining a use graph for all the functions you might emit in the entire translation unit seems very expensive and brittle. Have you considered doing this walk in a final pass? You could just build up a set of all the functions you know you're going to emit and then walk their bodies looking for uses of lazy-emitted entities. If we don't already have a function that calls a callback for every declaration ODR-used by a function body, we should.

The deferred diagnostic mechanism is shared between CUDA/HIP and OpenMP. The diagnostic messages not only depend on the callee, but also depend on the caller, the caller information needs to be kept. Also if a caller is to be emitted, all the deferred diagnostics associated with the direct or indirect callees need to be emitted. Therefore a call graph is needed for this mechanism.

If we ignore the dtor->deallocation edge in the call graph, we may miss diagnostics, e.g.

static __device__ __host__ void f(__m256i *p) {
  __asm__ volatile("vmovaps  %0, %%ymm0" ::"m"(*(__m256i *)p)
                 : "r0"); // MS-error{{unknown register name 'r0' in asm}}
}
struct CFileStream {
  void operator delete(void *p) {
    f(0);  // MS-note{{called by 'operator delete'}}
  }
  CFileStream();
  virtual ~CFileStream();  // MS-note{{called by '~CFileStream'}}
};

struct CMultiFileStream {
  CFileStream m_fileStream;
  ~CMultiFileStream();
};

// This causes vtable emitted so that deleting dtor is emitted for MS.
CFileStream::CFileStream() {}

Assuming the host compilation is on windows.

Here f() is a host device function which is unknown to be emitted, therefore the inline assembly error results in a delayed diagnostic. When f() is checked in the delete operator body, a 'delete operator -> f' edge is added to the call graph since f() is unknown to be emitted.

Since CFileStream::CFileStream is defined, clang sets vtbl to be emitted and does an explicit dtor check even though dtor is not defined. clang knows that this dtor check is for deleting dtor and will check delete operator as referenced, which causes `dtor -> delete operator' to be added to the call graph. Then clang checks dtor as referenced. Since deleting dtor will be emitted together with vtbl, clang should assume dtor is to be emitted. Then clang will found the callees 'delete operator' and f(), and emits the delayed diagnostics associated with them.

If we do not add 'dtor -> delete operator' edge to the call graph, the diagnostic msg in f() will not be emitted.

Most uses of the destructor do not use the delete operator, though, and therefore should not trigger the diagnostics in f to be emitted. And this really doesn't require a fully-realized use graph; you could very easily track the current use stack when making a later pass over the entities used.

Also I agree with Richard that you really need the v-table to be a node in your use graph/stack.

This doesn't look quite right to me. I don't think we should treat the delete this; for a destructor as being emitted-for-device in any translation unit in which the vtable is marked used. (For example, if in your testcase MSEmitDeletingDtor::CFileStream::CFileStream() were a __host__ function, I think you'd still diagnose, but presumably shouldn't do so, because the vtable -- and therefore CFileStream::operator delete -- is never referenced / emitted for the device.) Instead, I think we should treat the delete this; as being emitted in any translation unit in which the vtable itself is emitted-for-device. Presumably, this means you will need to model / track usage of the vtable itself in your "call graph".

A user declared ctor/dtor by default is __host__.

Let's consider this testcase:

static __device__ __host__ void f(__m256i *p) {
  __asm__ volatile("vmovaps  %0, %%ymm0" ::"m"(*(__m256i *)p)
                 : "r0"); // MS-error{{unknown register name 'r0' in asm}}
}
struct CFileStream {
  void operator delete(void *p) {
    f(0);  // MS-note{{called by 'operator delete'}}
  }
  CFileStream();
  virtual ~CFileStream();  // MS-note{{called by '~CFileStream'}}
};

struct CMultiFileStream {
  CFileStream m_fileStream;
  ~CMultiFileStream();
};

// This causes vtable emitted so that deleting dtor is emitted for MS.
CFileStream::CFileStream() {}

In host compilation, vtbl is emitted, since it causes dtor emitted, whereas dtor calls f(), therefore the diagnostic msg is emitted.

In device compilation, vtbl is not emitted, therefore dtor is not emitted, and the diagnostic msg in f() is not emitted.

We only need an entity in call graph if that entity can be called by other entities. Here vtbl is always at the top level of the 'call graph'. Therefore it is not needed to be in the call graph.

Most uses of the destructor do not use the delete operator, though, and therefore should not trigger the diagnostics in f to be emitted. And this really doesn't require a fully-realized use graph; you could very easily track the current use stack when making a later pass over the entities used.

The call graph is not for this specific situation. A call graph is needed because of the transitive nature of the deferred diagnostic message. That is, if any direct or indirect caller is emitted, the diagnostic msg needs to be emitted.

The deferred diagnostic msg is recorded when parsing a function body. At that time we do not know which function will directly or indirectly call it. How do we keep a use stack?

When we parsing other function bodies, we only know the direct callee. Since we do not know if this function indirectly calls the function with deferred diagnostics, we have to keep a record of all the caller/callee edges.

Most uses of the destructor do not use the delete operator, though, and therefore should not trigger the diagnostics in f to be emitted. And this really doesn't require a fully-realized use graph; you could very easily track the current use stack when making a later pass over the entities used.

The call graph is not for this specific situation. A call graph is needed because of the transitive nature of the deferred diagnostic message. That is, if any direct or indirect caller is emitted, the diagnostic msg needs to be emitted.

One of the points that Richard and I have been trying to make is that this really isn't specifically about *calls*, it's about *uses*. You only want to emit diagnostics associated with an entity if you actually have to emit that entity, and whether you emit an entity has nothing to do with what places might *call* it, but rather what places *use* it and therefore force it to be emitted. This is fortunate because call graphs are inherently imperfect because of indirect calls, but use graphs are totally reliable. It's also fortunate because it means you can piggy-back on all of the existing logic that Sema has for tracking ODR uses.

Richard and I are also pointing out that Sema has to treat the v-table as its own separate thing when tracking ODR uses, and so you need to as well. You want to emit diagnostics associated with a virtual function if you're emitting code that either (1) directly uses the function (e.g. by calling x->A::foo()) or (2) directly uses a v-table containing the function. You can't rely on Sema's normal ODR-use tracking for *either* of these, because Sema might have observed a use in code that you don't actually have to emit, e.g. host code if you're compiling for the device. That is, a v-table is only a "root" for virtual functions if you actually have to emit that v-table, and you can't know that without tracking v-tables in your use graph.

The deferred diagnostic msg is recorded when parsing a function body. At that time we do not know which function will directly or indirectly call it. How do we keep a use stack?

The "use stack" idea would apply if you switched from eagerly creating the entire use graph to instead just making a late pass that walked function bodies. If you walk function bodies depth-first, starting from a true root and gathering all the ODR-used entities to be recursively walked, then you can maintain a stack of what entities you're currently walking, and that stack is a use-path that explains why you need to emit the current function.

It should be straightforward to build a function that walks over the entities used by a function body and calls a callback by just extracting it out of the code in MarkDeclarationsUsedInExpr.

yaxunl updated this revision to Diff 240760.Jan 27 2020, 8:01 PM
yaxunl retitled this revision from [CUDA][HIP] Fix assertion due to dtor check on windows to [CUDA][HIP][OpenMP] Emit deferred diagnostics by a post-parsing AST travese.
yaxunl edited the summary of this revision. (Show Details)

Remove the call graph and do a final AST traverse by John's comments.

bader added a subscriber: bader.Jan 28 2020, 9:18 AM
yaxunl added a comment.EditedJan 30 2020, 8:39 AM

Most uses of the destructor do not use the delete operator, though, and therefore should not trigger the diagnostics in f to be emitted. And this really doesn't require a fully-realized use graph; you could very easily track the current use stack when making a later pass over the entities used.

The call graph is not for this specific situation. A call graph is needed because of the transitive nature of the deferred diagnostic message. That is, if any direct or indirect caller is emitted, the diagnostic msg needs to be emitted.

One of the points that Richard and I have been trying to make is that this really isn't specifically about *calls*, it's about *uses*. You only want to emit diagnostics associated with an entity if you actually have to emit that entity, and whether you emit an entity has nothing to do with what places might *call* it, but rather what places *use* it and therefore force it to be emitted. This is fortunate because call graphs are inherently imperfect because of indirect calls, but use graphs are totally reliable. It's also fortunate because it means you can piggy-back on all of the existing logic that Sema has for tracking ODR uses.

Richard and I are also pointing out that Sema has to treat the v-table as its own separate thing when tracking ODR uses, and so you need to as well. You want to emit diagnostics associated with a virtual function if you're emitting code that either (1) directly uses the function (e.g. by calling x->A::foo()) or (2) directly uses a v-table containing the function. You can't rely on Sema's normal ODR-use tracking for *either* of these, because Sema might have observed a use in code that you don't actually have to emit, e.g. host code if you're compiling for the device. That is, a v-table is only a "root" for virtual functions if you actually have to emit that v-table, and you can't know that without tracking v-tables in your use graph.

The deferred diagnostic msg is recorded when parsing a function body. At that time we do not know which function will directly or indirectly call it. How do we keep a use stack?

The "use stack" idea would apply if you switched from eagerly creating the entire use graph to instead just making a late pass that walked function bodies. If you walk function bodies depth-first, starting from a true root and gathering all the ODR-used entities to be recursively walked, then you can maintain a stack of what entities you're currently walking, and that stack is a use-path that explains why you need to emit the current function.

It should be straightforward to build a function that walks over the entities used by a function body and calls a callback by just extracting it out of the code in MarkDeclarationsUsedInExpr.

I updated the patch to remove the explicit call graph and use an AST traverse instead. Since this patch is big, is it OK to leave the tracking of vtable to some future patch? This patch is sufficient to fix the assertion seen on Windows. Thanks.

rjmccall added inline comments.Jan 30 2020, 10:17 AM
clang/lib/Sema/SemaExpr.cpp
17183

Is there any way to share most of the visitation logic here with the visitor we use in MarkDeclarationsUsedInExpr? Maybe make a UsedDeclVisitor CRTP class that calls a "asImpl().visitUsedDecl(SourceLocation Loc, Decl *D)" in the right places?

yaxunl updated this revision to Diff 242145.Feb 3 2020, 12:06 PM

revised by John's comments.

rjmccall added inline comments.Feb 3 2020, 3:10 PM
clang/lib/Sema/SemaExpr.cpp
17127

This should inherit from EvaluatedExprVisitor<Derived>, or else calls from EvaluatedExprVisitor and above won't dispatch all the way down to the subclass. This will allow subclasses to do node-specific logic, like your subclass's handling of InOMPDeviceContext or EvaluatedExprMarker's need to do custom things with local variables, DREs, and MEs.

Please also define this in a header; it doesn't need to be file-specific. I guess it needs a Sema & because of the call to LookupDestructor, so lib/Sema is probably the right place for that header.

17152

Let's not have both a visitDeclRefExpr and a VisitDeclRefExpr, distinguished only by capitalization.

17158

Please have all these call sites call asImpl().visitUsedDecl directly, and then don't define it in this class.

17195

This should be in your OMP-specific subclass.

yaxunl updated this revision to Diff 242251.Feb 3 2020, 8:29 PM
yaxunl marked 5 inline comments as done.

revised by John's comments.

rjmccall added inline comments.Feb 3 2020, 8:46 PM
clang/lib/Sema/SemaExpr.cpp
17254

Thanks, this looks a lot better.

Should this be moved to SemaOpenMP.cpp (and renamed to be OpenMP-specific), or do you think it's going to be useful in other modes?

clang/lib/Sema/UsedDeclVisitor.h
1 ↗(On Diff #242251)

Please fix this line.

rjmccall added inline comments.Feb 3 2020, 8:46 PM
clang/lib/Sema/UsedDeclVisitor.h
9 ↗(On Diff #242251)

"a CRTP class which visits all the declarations that are ODR-used by an expression or statement."

65 ↗(On Diff #242251)

It's generally best to asImpl() when restarting on a sub-expression like this, just in case the derived class wants to do something there. Same thing in VisitCXXBindTemporaryExpr.

yaxunl updated this revision to Diff 242296.Feb 4 2020, 4:59 AM
yaxunl marked 4 inline comments as done.

revised by John's comments.

clang/lib/Sema/SemaExpr.cpp
17254

It is not just for OpenMP. Deferred diagnostics are also emitted by CUDA/HIP.

One minor request, but otherwise LGTM; feel free to commit with that change.

clang/lib/Sema/SemaExpr.cpp
17254

Okay. Can it go in Sema.cpp next to the other overload of emitDeferredDiags, then? There isn't really much purpose to it being in this file.

yaxunl marked 2 inline comments as done.Feb 4 2020, 8:08 AM
yaxunl added inline comments.
clang/lib/Sema/SemaExpr.cpp
17254

will do when committing. thanks.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 16 2020, 7:47 PM
This revision was automatically updated to reflect the committed changes.
yaxunl marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 16 2020, 7:47 PM
hliao added a subscriber: hliao.Feb 16 2020, 8:00 PM

one header is missing and breaks the build

clang/lib/Sema/Sema.cpp
14

this file is missing and breaks the build

MaskRay added inline comments.
clang/lib/Sema/Sema.cpp
14
Fznamznon added inline comments.Feb 17 2020, 8:17 AM
clang/lib/Sema/Sema.cpp
1441

This particular change causes duplication of deferred diagnostics.
Consider the following example (please correct me if I'm doing something wrong, I'm not an expert in OpenMP):

int foobar1() { throw 1; } // error is expected here

// let's try to use foobar1 in the code where exceptions aren't allowed
#pragma omp declare target    
int (*B)() = &foobar1;        
#pragma omp end declare target

// and in some other place let's use foobar1 in device code again
#pragma omp declare target    
int a = foobar1();            
#pragma omp end declare target

Then diagnostic for foobar1 will be duplicated for each use of foobar1 under target directive.
I first experienced this behavior not with OpenMP, so I suppose reproducer can be done for each programming model which uses deferred diagnostics.

yaxunl marked an inline comment as done.Feb 17 2020, 10:27 AM
yaxunl added inline comments.
clang/lib/Sema/Sema.cpp
1441

The change is intentional so that each call chain causing the diagnostic can be identified. The drawback is that it is more verbose.

I can change this behavior so that the diagnostic will be emitted only for the first call chain that causes the diagnostic, if less verbose diagnostics is preferred.

This seems to result in triggering clang/lib/CodeGen/CGExpr.cpp:2626 when compiling mlir/lib/Transforms/AffineDataCopyGeneration.cpp with clang build with assertions on (clean build at e8e078c just before this change, broken at this, assert triggering at build fix commit).

https://buildkite.com/mlir/mlir-core/builds/2792#a54fb239-718b-4f0b-a309-f83e46ceb252

This seems to result in triggering clang/lib/CodeGen/CGExpr.cpp:2626 when compiling mlir/lib/Transforms/AffineDataCopyGeneration.cpp with clang build with assertions on (clean build at e8e078c just before this change, broken at this, assert triggering at build fix commit).

https://buildkite.com/mlir/mlir-core/builds/2792#a54fb239-718b-4f0b-a309-f83e46ceb252

Seems reasonable to revert if there's a testcase that they can get from rebuilding llvm with mlir enabled.

erichkeane added inline comments.
clang/lib/Sema/Sema.cpp
1486

Note that when recommitting this (if you choose to), this needs to also handle NamespaceDecl. We're a downstream and discovered that this doesn't properly handle functions or records handled in a namespace.

It can be implemented identically to TranslationUnitDecl.

rjmccall added inline comments.Feb 19 2020, 10:31 AM
clang/lib/Sema/Sema.cpp
1486

Wait, what? We shouldn't be doing this for TranslationUnitDecl either. I don't even know how we're "using" a TranslationUnitDecl, but neither this case not the case for NamespaceDecl should be recursively using every declaration declared inside it. If there's a declaration in a namespace that's being used, it should be getting visited as part of the actual use of it.

The logic for RecordDecl has the same problem.

erichkeane added inline comments.Feb 19 2020, 10:44 AM
clang/lib/Sema/Sema.cpp
1486

Despite the name, this seems to be more of a home-written ast walking class. The entry point is the 'translation unit' which seems to walk through everything in an attempt to find all the functions (including those that are 'marked' as used by an attribute).

You'll see the FunctionDecl section makes this assumption as well (not necessarily that we got to a function via a call). IMO, this approach is strange, and we should register entry points in some manner (functions marked as emitted to the device in some fashion), then just follow its call-graph (via the clang::CallGraph?) to emit all of these functions.

It seemed really odd to see this approach here, but it seemed well reviewed by the time I noticed it (via a downstream bug) so I figured I'd lost my chance to disagree with the approach.

rjmccall added inline comments.Feb 19 2020, 10:56 AM
clang/lib/Sema/Sema.cpp
1486

Sure, but visitUsedDecl isn't the right place to be entering the walk. visitUsedDecl is supposed to be the *callback* from the walk. If they need to walk all the global declarations to find kernels instead of tracking the kernels as they're encountered (which would be a *much* better approach), it should be done as a separate function.

I just missed this in the review.

Seems to me, it causes some other issues. See https://bugs.llvm.org/show_bug.cgi?id=44948 for example

Seems to me, it causes some other issues. See https://bugs.llvm.org/show_bug.cgi?id=44948 for example

I will fix that bug.

yaxunl marked an inline comment as done.Feb 19 2020, 2:05 PM
yaxunl added inline comments.
clang/lib/Sema/Sema.cpp
1486

The deferred diagnostics could be initiated by non-kernel functions or even host functions.

Let's consider a device code library where no kernels are defined. A device function is emitted, which calls a host device function which has a deferred diagnostic. All device functions that are emitted need to be checked.

Same with host functions that are emitted, which may call a host device function which has deferred diagnostic.

Also not just function calls need to be checked. A function address may be taken then called through function pointer. Therefore any reference to a function needs to be followed.

In the case of OpenMP, the initialization of a global function pointer which refers a function may trigger a deferred diangostic. There are tests for that.

rjmccall added inline comments.Feb 19 2020, 3:32 PM
clang/lib/Sema/Sema.cpp
1486

Right, I get that emitting deferred diagnostics for a declaration D needs to trigger any deferred diagnostics in declarations used by D, recursively. You essentially have a graph of lazily-emitted declarations (which may or may not have deferred diagnostics) and a number of eagerly-emitted "root" declarations with use-edges leading into that graph. Any declaration that's reachable from a root will need to be emitted and so needs to have any deferred diagnostics emitted as well. My question is why you're finding these roots with a retroactive walk of the entire translation unit instead of either building a list of roots as you go or (better yet) building a list of lazily-emitted declarations that are used by those roots. You can unambiguously identify at the point of declaration whether an entity will be eagerly or lazily emitted, right? If you just store those initial edges into the lazily-emitted declarations graph and then initiate the recursive walk from them at the end of the translation unit, you'll only end up walking declarations that are actually relevant to your compilation, so you'll have much better locality and (if this matters to you) you'll naturally work a lot better with PCH and modules.

yaxunl marked an inline comment as done.Feb 20 2020, 8:57 AM
yaxunl added inline comments.
clang/lib/Sema/Sema.cpp
1486

I will try the approach you suggested. Basically I will record the emitted functions and variables during parsing and use them as starting point for the final traversal.

This should work for CUDA/HIP. However it may be tricky for OpenMP since the emission of some entities depending on pragmas. Still it may be doable. If I encounter difficulty I will come back for discussion.

I will post the change for review.

Thanks.

bader added inline comments.Feb 20 2020, 9:06 AM
clang/lib/Sema/Sema.cpp
1486

FYI: SYCL is also using deferred diagnostics engine to emit device side diagnostics, although this part hasn't been up-streamed yet, but we are tracking changes in this area.
SYCL support implementation should be quite similar to CUDA/HIP.

I tried recording functions to be emitted during normal parsing and using it as starting point for the final traversal. It is quite promising. I only get one lit test failure for OpenMP:

int foobar2();

#pragma omp declare target
int (*B)() = &foobar2;
#pragma omp end declare target

int foobar2() { throw 1; } // expected-error {{cannot use 'throw' with exceptions disabled}}

In this case, the emission state of foobar2 cannot be determined by itself. It can only be determined to be emitted through variable B. Therefore, I also need to record variables that are potentially emitted.

I tried recording functions to be emitted during normal parsing and using it as starting point for the final traversal. It is quite promising. I only get one lit test failure for OpenMP:

int foobar2();

#pragma omp declare target
int (*B)() = &foobar2;
#pragma omp end declare target

int foobar2() { throw 1; } // expected-error {{cannot use 'throw' with exceptions disabled}}

In this case, the emission state of foobar2 cannot be determined by itself. It can only be determined to be emitted through variable B. Therefore, I also need to record variables that are potentially emitted.

Okay. Sounds like you have some common cause with https://reviews.llvm.org/D71227, then. Pinging @hliao.

Also, we cannot remove traversing of RecordDecl and CapturedDecl encountered in function body since we have OpenMP test like this:

int main() {
#pragma omp target
  {
    t1(0);
  }
  return 0;
}

This results in a kernel function embedded in a captured record decl in AST. We have to drill into the record decl to get the kernel and the function called by it.

I still got assertion when I use the built clang with check-mlir. The reduced testcase is

class A {
public:
  int foo();
};

static A a;

struct B {
  B(int x = a.foo());
};

void test() {
  B x;
}

The assertion I got is:

clang: /home/yaxunl/git/llvm/llvm/tools/clang/lib/CodeGen/CGExpr.cpp:2628: clang::CodeGen::LValue clang::CodeGen::CodeGenFunction::EmitDeclRefLValue(const clang::DeclRefExpr *): Assertion `(ND->isUsed(false) || !isa<VarDecl>(ND) || E->isNonOdrUse() || !E->getLocation().isValid()) && "Should not use decl without marking it used!"' failed.
Stack dump:


 #0 0x000000000258c614 PrintStackTraceSignalHandler(void*) (/home/yaxunl/git/llvm/assert/bin/clang+0x258c614)
 #1 0x000000000258a1ae llvm::sys::RunSignalHandlers() (/home/yaxunl/git/llvm/assert/bin/clang+0x258a1ae)
 #2 0x000000000258b7a2 llvm::sys::CleanupOnSignal(unsigned long) (/home/yaxunl/git/llvm/assert/bin/clang+0x258b7a2)
 #3 0x000000000251d0c3 (anonymous namespace)::CrashRecoveryContextImpl::HandleCrash(int, unsigned long) (/home/yaxunl/git/llvm/assert/bin/clang+0x251d0c3)
 #4 0x000000000251d1fc CrashRecoverySignalHandler(int) (/home/yaxunl/git/llvm/assert/bin/clang+0x251d1fc)
 #5 0x00007f0dde3bf390 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x11390)
 #6 0x00007f0ddcf29428 raise /build/glibc-LK5gWL/glibc-2.23/signal/../sysdeps/unix/sysv/linux/raise.c:54:0
 #7 0x00007f0ddcf2b02a abort /build/glibc-LK5gWL/glibc-2.23/stdlib/abort.c:91:0
 #8 0x00007f0ddcf21bd7 __assert_fail_base /build/glibc-LK5gWL/glibc-2.23/assert/assert.c:92:0
 #9 0x00007f0ddcf21c82 (/lib/x86_64-linux-gnu/libc.so.6+0x2dc82)
#10 0x0000000002a1a5df clang::CodeGen::CodeGenFunction::EmitDeclRefLValue(clang::DeclRefExpr const*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a1a5df)
#11 0x0000000002a0dfb6 clang::CodeGen::CodeGenFunction::EmitLValue(clang::Expr const*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a0dfb6)
#12 0x0000000002a39973 clang::CodeGen::CodeGenFunction::EmitCXXMemberOrOperatorMemberCallExpr(clang::CallExpr const*, clang::CXXMethodDecl const*, clang::CodeGen::ReturnValueSlot, bool, clang::NestedNameSpecifier*, bool, clang::Expr const*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a39973)
#13 0x0000000002a389b9 clang::CodeGen::CodeGenFunction::EmitCXXMemberCallExpr(clang::CXXMemberCallExpr const*, clang::CodeGen::ReturnValueSlot) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a389b9)
#14 0x0000000002a28f95 clang::CodeGen::CodeGenFunction::EmitCallExpr(clang::CallExpr const*, clang::CodeGen::ReturnValueSlot) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a28f95)
#15 0x0000000002a5be29 (anonymous namespace)::ScalarExprEmitter::VisitCallExpr(clang::CallExpr const*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a5be29)
#16 0x0000000002a55b19 clang::StmtVisitorBase<std::add_pointer, (anonymous namespace)::ScalarExprEmitter, llvm::Value*>::Visit(clang::Stmt*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a55b19)
#17 0x0000000002a4b615 clang::CodeGen::CodeGenFunction::EmitScalarExpr(clang::Expr const*, bool) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a4b615)
#18 0x0000000002a0da30 clang::CodeGen::CodeGenFunction::EmitAnyExpr(clang::Expr const*, clang::CodeGen::AggValueSlot, bool) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a0da30)
#19 0x0000000002a0edde clang::CodeGen::CodeGenFunction::EmitAnyExprToTemp(clang::Expr const*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a0edde)
#20 0x00000000029cdd6b clang::CodeGen::CodeGenFunction::EmitCallArg(clang::CodeGen::CallArgList&, clang::Expr const*, clang::QualType) (/home/yaxunl/git/llvm/assert/bin/clang+0x29cdd6b)
#21 0x00000000029ccc41 clang::CodeGen::CodeGenFunction::EmitCallArgs(clang::CodeGen::CallArgList&, llvm::ArrayRef<clang::QualType>, llvm::iterator_range<clang::Stmt::CastIterator<clang::Expr, clang::Expr const* const, clang::Stmt const* const> >, clang::CodeGen::CodeGenFunction::AbstractCallee, unsigned int, clang::CodeGen::CodeGenFunction::EvaluationOrder) (/home/yaxunl/git/llvm/assert/bin/clang+0x29ccc41)
#22 0x00000000028d8e7b void clang::CodeGen::CodeGenFunction::EmitCallArgs<clang::FunctionProtoType>(clang::CodeGen::CallArgList&, clang::FunctionProtoType const*, llvm::iterator_range<clang::Stmt::CastIterator<clang::Expr, clang::Expr const* const, clang::Stmt const* const> >, clang::CodeGen::CodeGenFunction::AbstractCallee, unsigned int, clang::CodeGen::CodeGenFunction::EvaluationOrder) (/home/yaxunl/git/llvm/assert/bin/clang+0x28d8e7b)
#23 0x00000000029de431 clang::CodeGen::CodeGenFunction::EmitCXXConstructorCall(clang::CXXConstructorDecl const*, clang::CXXCtorType, bool, bool, clang::CodeGen::AggValueSlot, clang::CXXConstructExpr const*) (/home/yaxunl/git/llvm/assert/bin/clang+0x29de431)
#24 0x0000000002a3b84e clang::CodeGen::CodeGenFunction::EmitCXXConstructExpr(clang::CXXConstructExpr const*, clang::CodeGen::AggValueSlot) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a3b84e)
#25 0x0000000002a32a8b (anonymous namespace)::AggExprEmitter::VisitCXXConstructExpr(clang::CXXConstructExpr const*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a32a8b)
#26 0x0000000002a2d44f clang::CodeGen::CodeGenFunction::EmitAggExpr(clang::Expr const*, clang::CodeGen::AggValueSlot) (/home/yaxunl/git/llvm/assert/bin/clang+0x2a2d44f)
#27 0x00000000029f96fc clang::CodeGen::CodeGenFunction::EmitExprAsInit(clang::Expr const*, clang::ValueDecl const*, clang::CodeGen::LValue, bool) (/home/yaxunl/git/llvm/assert/bin/clang+0x29f96fc)
#28 0x00000000029f68d9 clang::CodeGen::CodeGenFunction::EmitAutoVarInit(clang::CodeGen::CodeGenFunction::AutoVarEmission const&) (/home/yaxunl/git/llvm/assert/bin/clang+0x29f68d9)
#29 0x00000000029f1ca5 clang::CodeGen::CodeGenFunction::EmitVarDecl(clang::VarDecl const&) (/home/yaxunl/git/llvm/assert/bin/clang+0x29f1ca5)
#30 0x00000000029f1935 clang::CodeGen::CodeGenFunction::EmitDecl(clang::Decl const&) (/home/yaxunl/git/llvm/assert/bin/clang+0x29f1935)
#31 0x00000000027e07fb clang::CodeGen::CodeGenFunction::EmitDeclStmt(clang::DeclStmt const&) (/home/yaxunl/git/llvm/assert/bin/clang+0x27e07fb)
#32 0x00000000027d7a4c clang::CodeGen::CodeGenFunction::EmitSimpleStmt(clang::Stmt const*) (/home/yaxunl/git/llvm/assert/bin/clang+0x27d7a4c)
#33 0x00000000027d66cb clang::CodeGen::CodeGenFunction::EmitStmt(clang::Stmt const*, llvm::ArrayRef<clang::Attr const*>) (/home/yaxunl/git/llvm/assert/bin/clang+0x27d66cb)
#34 0x00000000027e15f0 clang::CodeGen::CodeGenFunction::EmitCompoundStmtWithoutScope(clang::CompoundStmt const&, bool, clang::CodeGen::AggValueSlot) (/home/yaxunl/git/llvm/assert/bin/clang+0x27e15f0)
#35 0x000000000282ffb6 clang::CodeGen::CodeGenFunction::GenerateCode(clang::GlobalDecl, llvm::Function*, clang::CodeGen::CGFunctionInfo const&) (/home/yaxunl/git/llvm/assert/bin/clang+0x282ffb6)
#36 0x000000000284dc52 clang::CodeGen::CodeGenModule::EmitGlobalFunctionDefinition(clang::GlobalDecl, llvm::GlobalValue*) (/home/yaxunl/git/llvm/assert/bin/clang+0x284dc52)
#37 0x0000000002845cc7 clang::CodeGen::CodeGenModule::EmitGlobalDefinition(clang::GlobalDecl, llvm::GlobalValue*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2845cc7)
#38 0x0000000002852271 clang::CodeGen::CodeGenModule::EmitTopLevelDecl(clang::Decl*) (/home/yaxunl/git/llvm/assert/bin/clang+0x2852271)

It is weird since this is not en OpenMP nor CUDA program and there is no deferred diags involved.

It seems somehow my change caused some decl missing used flag.

yaxunl updated this revision to Diff 247275.Feb 28 2020, 8:24 AM

Do not traverse the whole CU. Record potentially emitted functions and variables in the normal parsing and traverse them instead.

Also fixed bug 44948 and regression in check-mlir.

rnk removed a subscriber: rnk.Mar 6 2020, 2:14 PM
rjmccall added inline comments.Mar 7 2020, 11:57 AM
clang/include/clang/Sema/Sema.h
1432

This needs to be saved and restored in modules / PCH.

clang/lib/Sema/Sema.cpp
1444

Hmm. I know this is existing code, but I just realized something. I think it's okay to not emit the notes on every diagnostic, but you might want to emit them on the first diagnostic from a function instead of after the last. If the real bug is that the program is using something it's not supposed to use, and there are enough errors in that function to reach the error limit, then the diagnostics emitter will associate these notes with a diagnostic it's suppressing and so presumably suppress them as well, leaving the user with no way to find this information.

1466

This needs to trigger if you use a variable with delayed diagnostics, too, right?

When you add these methods to UsedDeclVisitor, you'll be able to remove them here.

1484

Should this also go in the base UsedDeclVisitor? I'm less sure about that because the captured statement is really always a part of the enclosing function, right? Should the delay mechanism just be looking through local sub-contexts instead?

clang/lib/Sema/UsedDeclVisitor.h
21 ↗(On Diff #247275)

Could you add this in a separate patch?

30 ↗(On Diff #247275)

There should definitely be cases in here for every expression that uses a declaration, including both DeclRefExpr and MemberExpr. Those might be overridden in subclasses, but that's their business; the default behavior should be to visit every used decl.

yaxunl marked 23 inline comments as done.Mar 16 2020, 6:01 PM
yaxunl added inline comments.
clang/include/clang/Sema/Sema.h
1432

done

clang/lib/Sema/Sema.cpp
1441

the change is intentional to report all use chains which result in deferred diagnostics, otherwise user may fix one issue then see another issue, instead of see all of the issues in one compilation.

1444

done

1466

fixed

1484

yes this one should also go to UsedDeclVisitor since this statement causes a RecordDecl generated which includes a FunctionDecl for a kernel, therefore this RecordDecl needs to be visited as used decl. I am not sure if other sub-context have the same effect. If so, I think they need to be handled case by case.

clang/lib/Sema/UsedDeclVisitor.h
21 ↗(On Diff #247275)
30 ↗(On Diff #247275)

done

yaxunl updated this revision to Diff 250666.Mar 16 2020, 6:06 PM
yaxunl marked 6 inline comments as done.

revised by John's comments.

rjmccall added inline comments.Mar 16 2020, 7:30 PM
clang/lib/Sema/Sema.cpp
1486

Okay, thank you. Do you still need all the cases in here for records, templates, and so on? It looks to me like you should always end up here with exactly the variables and functions that are being used, and you should never need to make special efforts to e.g. visit all the specializations of a template or visit all the methods of a class.

1505

Can there also be deferred diagnostics associated with this initializer?

clang/lib/Sema/SemaDecl.cpp
12229

DeclsToCheckForDeferredDiags is basically a set of declarations that you know to have to emit, right? It doesn't seem right to be adding every variable with an initializer to that set — especially because I'm pretty sure this function gets called for literally every variable with an initializer, including local variables. Presumably you only need to do this for global variables that you're definitely going to emit in the current mode.

yaxunl updated this revision to Diff 250960.Mar 17 2020, 6:49 PM
yaxunl marked 6 inline comments as done.

revised by John's comments.

clang/lib/Sema/Sema.cpp
1486

I can remove handling of templates and records. However I have to keep the handling of CapturedDecl. It is generated from code like

void t1(int r) {}

int main() {
#pragma omp target
  {
    t1(0);
  }
  return 0;
}

And it is like a function decl embeded in function main, e.g.

-FunctionDecl 0x86f7c70 <line:8:1, line:15:1> line:8:5 main 'int ()'
  `-CompoundStmt 0x873c3f8 <col:12, line:15:1>
    |-OMPTargetDirective 0x873c3a0 <line:9:1, col:19>
    | `-CapturedStmt 0x873c378 <line:10:3, line:13:3>
    |   `-CapturedDecl 0x873bd18 <<invalid sloc>> <invalid sloc> nothrow
    |     |-CapturedStmt 0x873c350 <line:10:3, line:13:3>
    |     | `-CapturedDecl 0x873c198 <<invalid sloc>> <invalid sloc> nothrow
    |     |   |-CompoundStmt 0x873c338 <line:10:3, line:13:3>
    |     |   | `-CallExpr 0x873c310 <line:12:5, col:9> 'void'
    |     |   |   |-ImplicitCastExpr 0x873c2f8 <col:5> 'void (*)(int)' <FunctionToPointerDecay>
    |     |   |   | `-DeclRefExpr 0x873c290 <col:5> 'void (int)' Function 0x86f7b18 't1' 'void (int)'
    |     |   |   `-IntegerLiteral 0x873c2b0 <col:8> 'int' 0
    |     |   `-ImplicitParamDecl 0x873c228 <line:9:1> col:1 implicit __context 'struct (anonymous at nvptx_va_arg_delayed_diags2.c:9:1) *const restrict'
    |     |-AlwaysInlineAttr 0x873c040 <<invalid sloc>> Implicit __forceinline
    |     |-ImplicitParamDecl 0x873bda0 <col:1> col:1 implicit .global_tid. 'const int'
    |     |-ImplicitParamDecl 0x873be08 <col:1> col:1 implicit .part_id. 'const int *const restrict'
    |     |-ImplicitParamDecl 0x873be70 <col:1> col:1 implicit .privates. 'void *const restrict'
    |     |-ImplicitParamDecl 0x873bed8 <col:1> col:1 implicit .copy_fn. 'void (*const restrict)(void *const restrict, ...)'
    |     |-ImplicitParamDecl 0x873bf40 <col:1> col:1 implicit .task_t. 'void *const'
    |     |-ImplicitParamDecl 0x873bfd8 <col:1> col:1 implicit __context 'struct (anonymous at nvptx_va_arg_delayed_diags2.c:9:1) *const restrict'
    |     |-RecordDecl 0x873c098 <col:1> col:1 implicit struct definition
    |     | `-CapturedRecordAttr 0x873c140 <<invalid sloc>> Implicit
    |     `-CapturedDecl 0x873c198 <<invalid sloc>> <invalid sloc> nothrow
    |       |-CompoundStmt 0x873c338 <line:10:3, line:13:3>
    |       | `-CallExpr 0x873c310 <line:12:5, col:9> 'void'
    |       |   |-ImplicitCastExpr 0x873c2f8 <col:5> 'void (*)(int)' <FunctionToPointerDecay>
    |       |   | `-DeclRefExpr 0x873c290 <col:5> 'void (int)' Function 0x86f7b18 't1' 'void (int)'
    |       |   `-IntegerLiteral 0x873c2b0 <col:8> 'int' 0
    |       `-ImplicitParamDecl 0x873c228 <line:9:1> col:1 implicit __context 'struct (anonymous at nvptx_va_arg_delayed_diags2.c:9:1) *const restrict'
    `-ReturnStmt 0x873c3e8 <line:14:3, col:10>
      `-IntegerLiteral 0x873c3c8 <col:10> 'int' 0

If I do not handle it, I will not be able to reach the call of t1().

1505

Yes. A global variable may be marked by omp declare target directive to be emitted on device. If the global var is initialized with the address of a function, the function will be emitted on device. If the device function calls a host device function which contains a deferred diag, that diag will be emitted. This can only be known after everything is parsed.

clang/lib/Sema/SemaDecl.cpp
12229

Yes we only need to check global variables. Fixed.

rjmccall added inline comments.Mar 18 2020, 10:37 AM
clang/lib/Sema/Sema.cpp
1486

Sure, although I wonder if it might be more reasonable to just make UsedDeclVisitor walk into CapturedDecls (and BlockDecls) when it sees the corresponding statements/expressions. Unlike other declaration references, those are never "cross-references"; they're just local code tied to a declaration for representational reasons.

1505

I meant directly with the initializer. Is there a way today to defer a diagnostic that you would emit while processing an initializer expression? If so, this needs to trigger that.

yaxunl marked 4 inline comments as done.Mar 18 2020, 1:43 PM
yaxunl added inline comments.
clang/lib/Sema/Sema.cpp
1486

done

1505

I don't think the initializer itself (without a target declare directive) will cause a deferred diagnostic since it does not cause change of emission states of functions.

yaxunl updated this revision to Diff 251167.Mar 18 2020, 1:46 PM
yaxunl marked 2 inline comments as done.

revised by John's comments

This looks good, assuming there's either no issue with the lazy emission of variables or that you just intend to tackle that later.

clang/lib/Sema/Sema.cpp
1505

Okay, so if I'm getting this right: only functions are emitted lazily, and variables have to be marked specially in order to get emitted on the device, so there's no need to defer diagnostics within variable initializations because we always know at the time of processing the variable where it will be emitted?

yaxunl marked 2 inline comments as done.Mar 18 2020, 3:18 PM
yaxunl added inline comments.
clang/lib/Sema/Sema.cpp
1505

right.

sugak added subscribers: weiwang, sugak.EditedAug 24 2021, 6:34 PM

Hi @yaxunl! I'm working on upgrading a large codebase from LLVM-9 to LLVM-12. I noticed on average 10% compilation speed regression that seems to be caused this change. We use Clang modules and historically provide -fopenmp compiler flag by default. The problem seems to be that compiling and importing modules is now slower, with the generated modules size increased by 2X. llvm-bcanalyzer tool shows that it's dominated by DECLS_TO_CHECK_FOR_DEFERRED_DIAGS. If I understand it right, your change is only relevant when target offloading is used. I inspected all of #pragma omp directives and can confirm that we don't use it.

I see that most of this code is gated by OpenMP flag. I wonder if there is a finer grain way to enable openmp parallel code generation without target offloading? Would it make sense to extend this code to check if -fopenom-targets is set before recording DECLS_TO_CHECK_FOR_DEFERRED_DIAGS?

Note, this was measured with @weiwang's https://reviews.llvm.org/D101793.

weiwang added a comment.EditedAug 26 2021, 10:01 AM

Hi @yaxunl! I'm working on upgrading a large codebase from LLVM-9 to LLVM-12. I noticed on average 10% compilation speed regression that seems to be caused this change. We use Clang modules and historically provide -fopenmp compiler flag by default. The problem seems to be that compiling and importing modules is now slower, with the generated modules size increased by 2X. llvm-bcanalyzer tool shows that it's dominated by DECLS_TO_CHECK_FOR_DEFERRED_DIAGS. If I understand it right, your change is only relevant when target offloading is used. I inspected all of #pragma omp directives and can confirm that we don't use it.

I see that most of this code is gated by OpenMP flag. I wonder if there is a finer grain way to enable openmp parallel code generation without target offloading? Would it make sense to extend this code to check if -fopenom-targets is set before recording DECLS_TO_CHECK_FOR_DEFERRED_DIAGS?

Note, this was measured with @weiwang's https://reviews.llvm.org/D101793.

@yaxunl We did an internal measurement by not adding decls into deferred diags, and that resolves the build regression. Wonder if we can have a special case for emitting diag as they are encountered when everything is on host side.

wenlei added a subscriber: wenlei.Aug 30 2021, 11:49 PM
hokein added a subscriber: hokein.EditedOct 21 2021, 5:43 AM

This patch seems to cause a new crash, details are at https://bugs.llvm.org/show_bug.cgi?id=52250.

yaxunl marked an inline comment as done.Oct 21 2021, 7:43 AM

This patch seems to cause a new crash, details are at https://bugs.llvm.org/show_bug.cgi?id=52250.

I will take a look. Thanks.