This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Offloading descriptor registration and device codegen.
ClosedPublic

Authored by sfantao on Sep 3 2015, 4:16 PM.

Details

Summary

In order to offloading work properly two things need to be in place:

  • a descriptor with all the offloading information (device entry functions, and global variable) has to be created by the host and registered in the OpenMP offloading runtime library.
  • all the device functions need to be emitted for the device and a convention has to be in place so that the runtime library can easily map the host ID of an entry point with the actual function in the device.

This patch adds support for these two things. However, only entry functions are being registered given that 'declare target' directive is not yet implemented.

About offloading descriptor:

The details of the descriptor are explained with more detail in http://goo.gl/L1rnKJ. Basically the descriptor will have fields that specify the number of devices, the pointers to where the device images begin and end (that will be defined by the linker), and also pointers to a the begin and end of table whose entries contain information about a specific entry point. Each entry has the type:

struct __tgt_offload_entry{
 void *addr;
 char *name;
 int64_t size;
};

and will be implemented in a pre determined (ELF) section .omp_offloading.entries with 1-byte alignment, so that when all the objects are linked, the table is in that section with no padding in between entries (will be like a C array). The code generation ensures that all __tgt_offload_entry entries are emitted in the same order for both host and device so that the runtime can have the corresponding entries in both host and device in same index of the table, and efficiently implement the mapping.

The resulting descriptor is registered/unregistered with the runtime library using the calls __tgt_register_lib and __tgt_unregister_lib. The registration is implemented in a high priority global initializer so that the registration happens always before any initializer (that can potentially include target regions) is run.

The driver flag -omptargets= was created to specify a comma separated list of devices the user wants to support so that the new functionality can be exercised. Each device is specified with its triple.

About target codegen:

The target codegen is pretty much straightforward as it reuses completely the logic of the host version for the same target region. The tricky part is to identify the meaningful target regions in the device side. Unlike other programming models, like CUDA, there are no already outlined functions with attributes that mark what should be emitted or not. So, the information on what to emit is passed in the form of metadata in host bc file. This requires a new option to pass the host bc to the device frontend. Then everything is similar to what happens in CUDA: the global declarations emission is intercepted to check to see if it is an "interesting" declaration. The difference is that instead of checking an attribute, the metadata information in checked. Right now, there is only a form of metadata to pass information about the device entry points (target regions). A class OffloadEntriesInfoManagerTy was created to manage all the information and queries related with the metadata. The metadata looks like this:

!omp_offload.info = !{!0, !1, !2, !3, !4, !5, !6}

!0 = !{i32 0, i32 52, i32 77426347, !"_ZN2S12r1Ei", i32 479, i32 13, i32 4}
!1 = !{i32 0, i32 52, i32 77426347, !"_ZL7fstatici", i32 461, i32 11, i32 5}
!2 = !{i32 0, i32 52, i32 77426347, !"_Z9ftemplateIiET_i", i32 444, i32 11, i32 6}
!3 = !{i32 0, i32 52, i32 77426347, !"_Z3fooi", i32 99, i32 11, i32 0}
!4 = !{i32 0, i32 52, i32 77426347, !"_Z3fooi", i32 272, i32 11, i32 3}
!5 = !{i32 0, i32 52, i32 77426347, !"_Z3fooi", i32 127, i32 11, i32 1}
!6 = !{i32 0, i32 52, i32 77426347, !"_Z3fooi", i32 159, i32 11, i32 2}

The fields in each metadata entry are (in sequence):
Entry 1) an ID of the type of metadata - right now only zero is used meaning "OpenMP target region".
Entry 2) a unique ID of the device where the input source file that contain the target region lives.
Entry 3) a unique ID of the file where the input source file that contain the target region lives.
Entry 4) a mangled name of the function that encloses the target region.
Entries 5) and 6) line and column number where the target region was found.
Entry 7) is the order the entry was emitted.

Entry 2) and 3) are required to distinguish files that have the same function name.
Entry 4) is required to distinguish different instances of the same declaration (usually templated ones)
Entries 5) and 6) are required to distinguish the particular target region in body of the function (it is possible that a given target region is not an entry point - if clause can evaluate always to zero - and therefore we need to identify the "interesting" target regions. )

This patch replaces http://reviews.llvm.org/D12306.

Diff Detail

Event Timeline

sfantao updated this revision to Diff 33971.Sep 3 2015, 4:16 PM
sfantao retitled this revision from to [OpenMP] Offloading descriptor registration and device codegen..
sfantao updated this object.
sfantao added reviewers: ABataev, rjmccall, hfinkel, tra.
sfantao updated this object.
sfantao added a subscriber: cfe-commits.
sfantao updated this object.Sep 3 2015, 4:18 PM
sfantao updated this revision to Diff 34955.Sep 16 2015, 6:03 PM
sfantao updated this object.

Rebase on top of last changes in http://reviews.llvm.org/D12871.

sfantao updated this revision to Diff 36320.Oct 1 2015, 4:18 PM

Rebase on top of the last changes in http://reviews.llvm.org/D12871.

sfantao updated this revision to Diff 36410.Oct 2 2015, 4:45 PM

Fix bug for when no offloading triples are specified.

sfantao updated this revision to Diff 36812.Oct 7 2015, 5:18 PM

Rebase and fix typo in regression test directive.

ABataev added inline comments.Oct 7 2015, 8:27 PM
lib/CodeGen/CGOpenMPRuntime.h
323

Maybe it is better to make it a class if it has some non-public members?

sfantao updated this object.Oct 8 2015, 8:49 AM
sfantao updated this revision to Diff 36880.Oct 8 2015, 11:19 AM

Use class instead of structs if aggregate have private or protected fields.

Thanks for the comments!

lib/CodeGen/CGOpenMPRuntime.h
323

Ok, replaced struct by class in OffloadEntriesInfoManagerTy and the other subtypes that have either private or protected fields.

ABataev added inline comments.Oct 14 2015, 9:51 PM
lib/CodeGen/CGOpenMPRuntime.cpp
1962

~0u

2349

I think there should be 4-bytes padding between NumDevices and DeviceImages fields in 64-bit mode, right? It is better to create this structure as clang AST RecordDecl/CXXRecordDecl and then use CGM.getTypes().ConvertTypeForMem().

lib/CodeGen/CGOpenMPRuntime.h
325–328

I think these members must be private.

339

Better ~0

343

Make it private/protected also

345

explicit, ~0u

367–370

Private/protected

372

explicit

373

~0u

918–928

Do we really need this stuff? CodeGenFunction has member CurFuncDecl, which, it seems, could give you required info without these functions.

sfantao updated this revision to Diff 37802.Oct 19 2015, 3:18 PM

Use GlobalDecl to forward information about the name of OpenMP region's enclosing function to the OpenMP outlined functions. This replaces the initial implementation that was using a stack to keep this information.

Add regression test that checks that the target region name mangling is correct if enclosed in a lambda function. I had to add extra logic in the scanning of the target regions because lambda function are emitted as global definitions.

Other minor changes to address Alexey's comments.

lib/CodeGen/CGOpenMPRuntime.cpp
1962

Done.

2349

Done!

lib/CodeGen/CGOpenMPRuntime.h
325–328

Done! Also added some setters and getters for the privatized fields.

339

Done!

343

Done!

345

Done

367–370

Done

372

Done

373

Done

918–928

I can't rely on CurFuncDecl because the parent function can be in some cases an implicit outlined function, and what I need is the enclosing user function.

In the new diff, I implemented this in a slightly different way: I forward the user function GlobalDecl to the implicit functions (see GenerateOpenMPCapturedStmtFunction).

In order for this to work I had to add special login in the scanning of the target regions to deal with lambda functions given that they are also emitted as global definitions.

Hope you like this approach better.

rjmccall edited edge metadata.Oct 21 2015, 11:34 AM

CurFuncDecl is supposed to be the enclosing user function. Things like outlined functions should be getting stored in CurCodeDecl; that's how it's done for blocks and lambdas.

Hi John,

Thanks for the remark!

CurFuncDecl is supposed to be the enclosing user function. Things like outlined functions should be getting stored in CurCodeDecl; that's how it's done for blocks and lambdas.

Apologies I was not accurate in my previous post. CurFuncDecl is in fact the declaration of the enclosing user function. What is not defined in some times undefined is CurGD and this is what I was trying to use to get the right mangled name of the user function, given that it also encodes the structor type. So my question is: is there a good/safe way to get the mangled name of the user function given the function declaration? I didn't find any good way to do that without replicating part of the stuff that happens in the mangler.

Just a little bit of context. The reason I was relying on the mangled name (along with source information) to unequivocally identify a target region is that it seemed as the most straightforward way to differentiate between different instances of the same template function/aggregate. So if you are aware of a different/easier way to accomplish the same goal please let me know.

Thanks again!
Samuel

Hi John,

Thanks for the remark!

CurFuncDecl is supposed to be the enclosing user function. Things like outlined functions should be getting stored in CurCodeDecl; that's how it's done for blocks and lambdas.

Apologies I was not accurate in my previous post. CurFuncDecl is in fact the declaration of the enclosing user function. What is not defined in some times undefined is CurGD and this is what I was trying to use to get the right mangled name of the user function, given that it also encodes the structor type. So my question is: is there a good/safe way to get the mangled name of the user function given the function declaration? I didn't find any good way to do that without replicating part of the stuff that happens in the mangler.

You don't actually want the structor type of the parent, because the nested declaration is logically the same declaration across all of them. For example, a lambda used in a constructor is still just a single type; there aren't implicitly 1-3 different types just because there are 1-3 different variant entrypoints for the constructor.

The way this generally works is that you just pick a single canonical variant. For example, the Itanium ABI says that you mangle local entities within constructors as if they were defined within the complete-object variant. If you want to add a method to one of the CXXABI objects to pick a canonical GD for a declaration, feel free.

sfantao updated this revision to Diff 39588.Nov 6 2015, 1:53 PM
sfantao edited edge metadata.

Use CurFuncDecl to generate offload kernel names as suggested by John McCall.

Hi John,

Thanks for the remark!

CurFuncDecl is supposed to be the enclosing user function. Things like outlined functions should be getting stored in CurCodeDecl; that's how it's done for blocks and lambdas.

Apologies I was not accurate in my previous post. CurFuncDecl is in fact the declaration of the enclosing user function. What is not defined in some times undefined is CurGD and this is what I was trying to use to get the right mangled name of the user function, given that it also encodes the structor type. So my question is: is there a good/safe way to get the mangled name of the user function given the function declaration? I didn't find any good way to do that without replicating part of the stuff that happens in the mangler.

You don't actually want the structor type of the parent, because the nested declaration is logically the same declaration across all of them. For example, a lambda used in a constructor is still just a single type; there aren't implicitly 1-3 different types just because there are 1-3 different variant entrypoints for the constructor.

The way this generally works is that you just pick a single canonical variant. For example, the Itanium ABI says that you mangle local entities within constructors as if they were defined within the complete-object variant. If you want to add a method to one of the CXXABI objects to pick a canonical GD for a declaration, feel free.

Thanks for explaining that!

I am now relying exclusively on CurFuncDecl in the last diff.

As for the structor variants, I am now using the complete variant to generate the names of the kernels as you suggested. I didn't add any method to CXXABI as that will require extra logic in ASTContext to make that visible during the code generation. Instead, I hardcoded Ctor[Dtor]_Complete in the code generation, similarly to what is done in the name mangler. Let me know if you'd rather have the method in CXXABI.

Let me know other comments suggestions you may have.

Thanks again!
Samuel

As for the structor variants, I am now using the complete variant to generate the names of the kernels as you suggested. I didn't add any method to CXXABI as that will require extra logic in ASTContext to make that visible during the code generation. Instead, I hardcoded Ctor[Dtor]_Complete in the code generation, similarly to what is done in the name mangler. Let me know if you'd rather have the method in CXXABI.

I think just using the complete variant is a totally reasonable choice.

I'll try to take a look at the rest of the patch soon.

John.

Any more comments on this patch?

Thanks!
Samuel

sfantao updated this revision to Diff 41670.Dec 2 2015, 1:44 PM
  • Fix tests to use the new captures types that OpenMP 4.5 requires.
  • Rebase.

Are there any more comments for this patch?

Thanks!
Samuel

Any more comments on this patch?

Thanks!
Samuel

Is anyone feeling responsible for a final go / review? It would be great for our research project to have at least basic offloading support in 3.8...

I'm sorry for the delay. It looks fine, thanks.

sfantao accepted this revision.Jan 5 2016, 8:25 AM
sfantao added a reviewer: sfantao.
This revision is now accepted and ready to land.Jan 5 2016, 8:25 AM
sfantao closed this revision.Jan 5 2016, 8:26 AM