This is an archive of the discontinued LLVM Phabricator instance.

[OpenCL] Split opencl-c.h header
AbandonedPublic

Authored by asavonic on Aug 31 2018, 8:49 AM.

Details

Reviewers
Anastasia
yaxunl
Summary

TL;DR

This patch splits huge opencl-c.h header into multiple headers to
support efficient use of Precompiled Headers (or Modules):

  • opencl-c-defs.h contains all preprocessor macros. Macros are not saved in PCHs.
  • opencl-c-common.h contains builtins which do not depend on any preprocessor macro (extensions, OpenCL C version).
  • opencl-c-fp16.h and opencl-c-fp64.h contain builtins which require either cl_khr_fp16 or cl_khr_fp64 macros, but they do not depend on other extensions and OpenCL C version.
  • opencl-c-platform.h (looking for a better name! 'target' maybe?) contains all other builtins which have more complicated requirements.

Umbrella header opencl-c.h includes all headers above, so this change
is backward compatible with the original design.

Details

Typical OpenCL compiler implicitly includes opencl-c.h before
compiling user code:

#include "opencl-c.h"
__kernel void k() { printf("hello world\n"); }

With this approach, even for tiny programs compiler spends most of its
time on parsing of opencl-c.h header (which has more than 16000 LOC),
so it takes ~1s to compile our hello world example into LLVM IR.

Obvious solution for this problem is to compile opencl-c.h into an AST
only once, and then use this AST to compile user code. This feature
can be implemented using Precompiled Headers or Clang Modules, but it
has one major drawback: AST must be built with the same set of
preprocessor macros and the same target triple as the the user
code. Consider the following example:

opencl-c.h:

#if __OPENCL_C_VERSION__ >= CL_VERSION_2_0
  float fract(float x, float *iptr);
#else
  float fract(float x, __global float *iptr);
#endif

If we compile this opencl-c.h into an AST, only one of these two
functions will be present. We cannot use this AST to compile for both
OpenCL C 1.2 and 2.0.

Another example: if we compile opencl-c.h into an AST with
spir-unknown-unknown (32bit) target triple, AST will have 'int'
instead of 'size_t':

opencl-c.h:

typedef __SIZE_TYPE__ size_t; // __SIZE_TYPE__ is defined to int by clang
size_t get_global_size(uint dimindx);

This makes the AST non-portable, and it cannot be used to compile for
spir64 triple (where size_t is 64bit integer).

If we want compiler to support CL1.2/CL2.0 for spir/spir64 triple,
we'll need to use 4 different PCHs compiled with different flags:

-cl-std=CL1.2 + -triple spir
-cl-std=CL1.2 + -triple spir64
-cl-std=CL2.0 + -triple spir
-cl-std=CL2.0 + -triple spir64

Things are getting worse if we want to support multiple devices, which
have different sets of OpenCL extensions. For example, if we want to
support both CPU (which supports cl_khr_fp64) and GPU (which supports
cl_khr_fp16), we must compile opencl-c.h using different options,
because it has the following code all over the place:

#ifdef cl_khr_fp64
  uchar __ovld __cnfn convert_uchar(double);
#endif
#ifdef cl_khr_fp16
  uchar __ovld __cnfn convert_uchar(half);
#endif

So if we want to add these 2 devices, we now need 4*2 different PCHs,
since every combination of -cl-std and -triple must be compiled twice
with different OpenCL extensions defines.

Size of each PCH is 2.5M, so we need ~20M of memory to store our
PCHs. If we want to add more devices or support another OpenCL C
version, the size will double.

For a C/C++ compiler this is not a problem: clang maintains a cache so
that only combinations which are actually used are saved to a
disk. OpenCL compilers are different, because they are often
distributed as a shared library. Being a library, it is not expected
to write something on a disk, because it rises a number of questions,
such as: "where do we have a writable temporary directory?" and "how
much space we can consume for PCHs?". For some targets (say, embedded
devices) it is better to avoid this approach.

To solve this problem, the patch splits opencl-c.h header into
multiple headers.

First, we split out 'common' functions, which should be present
regardless of OpenCL C version or extensions. All builtins which have
'core' types, such as convert, math and simple image builtins are
moved into opencl-c-common.h (8K LOC).

Second, split out all functions which are 'common' between OpenCL
version, but require either cl_khr_fp16 or cl_khr_fp64 (not both at
the same time!). Two headers (opencl-c-fp16.h and opencl-c-fp64.h)
have 2K LOC each.

Everything else goes into opencl-c-platform.h (5K LOC).

All macros go in opencl-c-defs.h, because macros cannot be
pre-compiled into AST.

With this setup, we can compile opencl-c-common.h, opencl-c-fp16.h and
opencl-c-fp64.h into PCHs with one set of extensions/OpenCL version,
and use them for any other set of extensions/OpenCL version. Clang
will detect this and throw out an error, which can be safely disabled
by -fno-validate-pch option.

opencl-c-platform.h (5K LOC) must still be pre-compiled for each
supported combination, but since it is a lot smaller (~0.5M vs
original 2.5M), it is not that bad. Or we can sacrifice some
performance and leave this header without pre-compilation: large
portion of opencl-c-platform.h contains vendor extensions, so it will
be removed by preprocessor anyway.

Diff Detail

Event Timeline

asavonic created this revision.Aug 31 2018, 8:49 AM
krisb added a subscriber: krisb.Sep 3 2018, 1:33 AM
svenvh added a subscriber: svenvh.Sep 3 2018, 8:16 AM
svenvh added inline comments.Sep 4 2018, 3:05 AM
lib/Headers/opencl-c-common.h
9

Would it be worth having a brief explanation here about what is supposed to go in this file? Although that merely repeats what's already in opencl-c.h of course.

Or perhaps you could make the description on line 1 more specific to each file.

lib/Headers/opencl-c.h
23

nitpick: double "which"

It seems generally good to partition this big header but I am trying to understand what problem is it trying to solve now? The unsupported declarations are guarded out by #if defined(ext_name) and therefore won't be parsed and put into PCH is extensions are not supported by some target. So I guess it can save the space when installing the headers because we don't need to copy all of them into the installation? Although this is not implemented currently.

It seems generally good to partition this big header but I am trying to understand what problem is it trying to solve now?

Main motivation is to reduce memory footprint by factoring out everything that is 'common' between PCHs compiled for different targets (CL1.2/2.0 and spir/spir64).

So if we want to add these 2 devices, we now need 4*2 different PCHs,
since every combination of -cl-std and -triple must be compiled twice
with different OpenCL extensions defines.

Size of each PCH is 2.5M, so we need ~20M of memory to store our
PCHs. If we want to add more devices or support another OpenCL C
version, the size will double.

This also allows to do 'partial' pre-compilation, where 'common' part is pre-compiled into a single target independent PCH, and plain header is used for all target-specific things.
In this case, you no longer need to maintain different target-specific PCHs and this greatly simplifies the whole process.

opencl-c-platform.h (5K LOC) must still be pre-compiled for each
supported combination, but since it is a lot smaller (~0.5M vs
original 2.5M), it is not that bad. Or we can sacrifice some
performance and leave this header without pre-compilation: large
portion of opencl-c-platform.h contains vendor extensions, so it will
be removed by preprocessor anyway.

It seems generally good to partition this big header but I am trying to understand what problem is it trying to solve now?

Main motivation is to reduce memory footprint by factoring out everything that is 'common' between PCHs compiled for different targets (CL1.2/2.0 and spir/spir64).

So if we want to add these 2 devices, we now need 4*2 different PCHs,
since every combination of -cl-std and -triple must be compiled twice
with different OpenCL extensions defines.

Size of each PCH is 2.5M, so we need ~20M of memory to store our
PCHs. If we want to add more devices or support another OpenCL C
version, the size will double.

This also allows to do 'partial' pre-compilation, where 'common' part is pre-compiled into a single target independent PCH, and plain header is used for all target-specific things.
In this case, you no longer need to maintain different target-specific PCHs and this greatly simplifies the whole process.

opencl-c-platform.h (5K LOC) must still be pre-compiled for each
supported combination, but since it is a lot smaller (~0.5M vs
original 2.5M), it is not that bad. Or we can sacrifice some
performance and leave this header without pre-compilation: large
portion of opencl-c-platform.h contains vendor extensions, so it will
be removed by preprocessor anyway.

Currently the main header still contains everything, so the size of the PCH won't change. So I was wondering what would be the way to avoid this... other than adding explicit #include.

Currently the main header still contains everything, so the size of the PCH won't change.

The idea is that we don't pre-compile the whole opencl-c.h, we split
it into several headers (3 of them are target independent) and
pre-compile them instead.

With this approach, we can reuse target-independent PCHs (common,
fp16, fp64) and only duplicate target-specific PCHs if needed
(opencl-c-platform.h).

The idea is basically:

  1. Compile target-independent headers into modules:
    • opencl-c-common.h -> opencl-c-common.pcm
    • opencl-c-fp16.h -> opencl-c-fp16.pcm
    • opencl-c-fp64.h -> opencl-c-fp64.pcm
  1. Implicitly include opencl-c.h (plain header), which has the following content:
#include "opencl-c-common.h"
#if cl_khr_fp16
#include "opencl-c-fp16.h"
#endif
#if cl_khr_fp64
#include "opencl-c-fp16.h"
#endif
#include "opencl-c-platform.h"

When compiler reaches an #include statement in opencl-c.h, it loads a
corresponding PCH. Headers that were not pre-compiled are included as
usual.

Currently the main header still contains everything, so the size of the PCH won't change.

The idea is that we don't pre-compile the whole opencl-c.h, we split
it into several headers (3 of them are target independent) and
pre-compile them instead.

With this approach, we can reuse target-independent PCHs (common,
fp16, fp64) and only duplicate target-specific PCHs if needed
(opencl-c-platform.h).

The idea is basically:

  1. Compile target-independent headers into modules:
    • opencl-c-common.h -> opencl-c-common.pcm
    • opencl-c-fp16.h -> opencl-c-fp16.pcm
    • opencl-c-fp64.h -> opencl-c-fp64.pcm
  1. Implicitly include opencl-c.h (plain header), which has the following content:
#include "opencl-c-common.h"
#if cl_khr_fp16
#include "opencl-c-fp16.h"
#endif
#if cl_khr_fp64
#include "opencl-c-fp16.h"
#endif
#include "opencl-c-platform.h"

When compiler reaches an #include statement in opencl-c.h, it loads a
corresponding PCH. Headers that were not pre-compiled are included as
usual.

Ok, I see. It seems reasonable. I am not sure if opencl-c-platform is the right name, since it seems some of the bits are just version/extension dependent there and not the platform? But perhaps if you add some description with guidelines on what should be put in each header, it should be fine.

With this setup, we can compile opencl-c-common.h, opencl-c-fp16.h and
opencl-c-fp64.h into PCHs with one set of extensions/OpenCL version,
and use them for any other set of extensions/OpenCL version. Clang
will detect this and throw out an error, which can be safely disabled
by -fno-validate-pch option.

However, keeping this as a permanent solution is unsafe. Because this way can result in unexpected errors to be silent out and allow erroneous configurations to be accepted successfully without any notification. So I am wondering if there is any plan to put a proper solution in place at some point?

With this setup, we can compile opencl-c-common.h, opencl-c-fp16.h and
opencl-c-fp64.h into PCHs with one set of extensions/OpenCL version,
and use them for any other set of extensions/OpenCL version. Clang
will detect this and throw out an error, which can be safely disabled
by -fno-validate-pch option.

However, keeping this as a permanent solution is unsafe. Because this way can result in unexpected errors to be silent out and allow erroneous configurations to be accepted successfully without any notification.

Agree. LIT test in test/Headers/opencl-c-header-split.cl is supposed
to verify that common/fp16/fp64 headers do not use preprocessor macros
and it should catch most of the issues, but this is definitely not the
most robust solution.

So I am wondering if there is any plan to put a proper solution in place at some point?

This solution can be improved if we implement convert and shuffle
as clang builtins with custom typechecking: these two builtins (with
all their overloadings) take ~90% of opencl-c-common.h and ~50% of
fp16/fp64 headers.

However, this can be a non-trivial change: it is difficult to do a
proper mangling for clang builtins and rounding modes must be handled
as well.

I'll take a few days to prototype this. If it turns out to be good in
terms of performance/footprint, we can drop this patch.

With this setup, we can compile opencl-c-common.h, opencl-c-fp16.h and
opencl-c-fp64.h into PCHs with one set of extensions/OpenCL version,
and use them for any other set of extensions/OpenCL version. Clang
will detect this and throw out an error, which can be safely disabled
by -fno-validate-pch option.

However, keeping this as a permanent solution is unsafe. Because this way can result in unexpected errors to be silent out and allow erroneous configurations to be accepted successfully without any notification.

Agree. LIT test in test/Headers/opencl-c-header-split.cl is supposed
to verify that common/fp16/fp64 headers do not use preprocessor macros
and it should catch most of the issues, but this is definitely not the
most robust solution.

So I am wondering if there is any plan to put a proper solution in place at some point?

This solution can be improved if we implement convert and shuffle
as clang builtins with custom typechecking: these two builtins (with
all their overloadings) take ~90% of opencl-c-common.h and ~50% of
fp16/fp64 headers.

However, this can be a non-trivial change: it is difficult to do a
proper mangling for clang builtins and rounding modes must be handled
as well.

I'll take a few days to prototype this. If it turns out to be good in
terms of performance/footprint, we can drop this patch.

Btw, I was wondering if a target agnostic PCH is a viable solution to move forward too. Something that we discussed/prototyped during the HackerLab in EuroLLVM?

bader removed a reviewer: bader.May 18 2021, 6:12 AM
asavonic abandoned this revision.Nov 9 2021, 2:22 AM
lib/Headers/opencl-c-fp64.h