Page MenuHomePhabricator

yaxunl (Yaxun Liu)
User

Projects

User does not belong to any projects.

User Details

User Since
May 13 2015, 10:16 AM (317 w, 3 d)

Recent Activity

Yesterday

yaxunl accepted D104062: [HIP] Fix --hip-version flag with 0 as component.

LGTM. Thanks.

Fri, Jun 11, 9:18 AM · Restricted Project

Thu, Jun 10

yaxunl added inline comments to D104062: [HIP] Fix --hip-version flag with 0 as component.
Thu, Jun 10, 8:02 PM · Restricted Project
yaxunl added a comment to D101630: [HIP] Add --gpu-bundle-output.

got test failures on some bots. fixed by 734213d7b51f9ea22a9d122c0646ca5b69f88ac8

Thu, Jun 10, 6:15 AM · Restricted Project
yaxunl committed rG734213d7b51f: Fix test hip-device-compile.hip (authored by yaxunl).
Fix test hip-device-compile.hip
Thu, Jun 10, 6:14 AM

Wed, Jun 9

yaxunl committed rG5fc2673fbce2: [HIP] Add --gpu-bundle-output (authored by yaxunl).
[HIP] Add --gpu-bundle-output
Wed, Jun 9, 8:55 PM
yaxunl closed D101630: [HIP] Add --gpu-bundle-output.
Wed, Jun 9, 8:54 PM · Restricted Project

Tue, Jun 8

yaxunl accepted D103920: [amdgpu] Add `-enable-ocl-mangling-mismatch-workaround`..

LGTM. Thanks!

Tue, Jun 8, 12:24 PM · Restricted Project
yaxunl committed rG054cc3b1b469: [CUDA][HIP] Fix store of vtbl in ctor (authored by yaxunl).
[CUDA][HIP] Fix store of vtbl in ctor
Tue, Jun 8, 7:25 AM
yaxunl closed D103835: [CUDA][HIP] Fix store of vtbl in ctor.
Tue, Jun 8, 7:25 AM · Restricted Project

Mon, Jun 7

yaxunl added inline comments to D101630: [HIP] Add --gpu-bundle-output.
Mon, Jun 7, 12:49 PM · Restricted Project
yaxunl requested review of D103835: [CUDA][HIP] Fix store of vtbl in ctor.
Mon, Jun 7, 12:09 PM · Restricted Project
yaxunl updated the diff for D101630: [HIP] Add --gpu-bundle-output.

use --gpu-bundle-output to control bundling/unbundling output of HIP device only compilation

Mon, Jun 7, 9:09 AM · Restricted Project
yaxunl added a comment to D101777: [clang] p1099 1/5: [NFC] Break out BaseUsingDecl from UsingDecl.

It seems this patch caused build failure:

Mon, Jun 7, 7:19 AM · Restricted Project, Restricted Project

Fri, Jun 4

yaxunl added inline comments to D103108: [CUDA][HIP] Promote const variables to constant.
Fri, Jun 4, 3:55 PM · Restricted Project
yaxunl added a comment to D101630: [HIP] Add --gpu-bundle-output.

But how do we control emitting LLVM IR with or without bundle? -emit-llvm -emit-gpu-object or -emit-llvm -emit-gpu-bundle? -emit-* is usually for specifying a specific file type.

Hmm. I forgot that HIP can bundle things other than objects. -emit-llvm -emit-gpu-bundle looks reasonable, but -emit-llvm -emit-gpu-object is indeed odd.
OK. Making it some sort of do/do-not bundle flag makes sense. How about just --[no-]gpu-bundle-output?

Fri, Jun 4, 10:49 AM · Restricted Project
yaxunl added a comment to D101630: [HIP] Add --gpu-bundle-output.

For sure we will need -fgpu-bundle-device-output to control bundling of intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be redundant and can cause confusion. What if users specify -c -fgpu-bundle-device-output -emit-gpu-object or -c -fno-gpu-bundle-device-output -emit-gpu-bundle? To me a single option -fgpu-bundle-device-output to control all device output seems cleaner.

The idea is to use -emit-gpu-object and -emit-gpu-bundle instead of the -f[no-]gpu-bundle-device-output. Otherwise they'd do exactly the same thing.

I think -emit-gpu-{object,bundle} has a minor edge over -f[no-]gpu-bundle-device-output as it's similar to other -emit options for controlling clang compilation phases (and that's what we want to do here), while -f options are usually for tweaking code generation.

Fri, Jun 4, 10:05 AM · Restricted Project
yaxunl added a comment to D101630: [HIP] Add --gpu-bundle-output.

I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, probably it makes sense not to bundle by default.

Agreed.

However, for default action (emitting object), we need to bundle by default since it was the old behavior and existing HIP apps depend on that.

Existing use is a valid point.
As a counterargument, I would suggest that in a compilation pipeline which does include bundling, an object file for one GPU variant *is* an intermediate output, similar to the ones you've listed above.

The final product of device-side subcompilations is a bundle. The question is what does "-c" mean?. Is it produce an object file or compile till the end of the pipeline ?
For CUDA and HIP compilation it's ambiguous. When we target just one GPU, it would be closer to the former. In general, it would be closer to the latter. NVCC side-steps the issue by using a different flags -cubin/-fatbin to disambiguate between two cases and avoid bolting on CUDA-related semantics on the compiler flags that were not designed for that.

Then we allow -fhip-bundle-device-output to override the default behavior.

OK. Bundling objects for HIP by default looks like a reasonable compromise.
It would be useful to generalize the flag to -fgpu-bundle... as it would be useful if/when we want to produce a fatbin during CUDA compilation. I'd still keep no-bundling as the default for CUDA's objects.

Now that we are in agreement of what we want, the next question is *how* we want to do it.

It appears that there's a fair bit of similarity between what the proposed -fgpu-bundle flag does and the handful of --emit-... options clang has now.
If we were to use something like --emit-gpu-object and --emit-gpu-bundle, it would be similar to NVCC's -cubin/-fatbinary, would decouple the default behavior for -c --cuda-device-only from the user's ability to specify what they want without burdening -c with additional flags that would have different defaults under different circumstances.

Compilation with "-c" would remain the "compile till the end", whatever it happens to mean for particular language and --emit-object/bundle would tell the compiler how far we want it to proceed and what kind of output we want. This would probably be easier to explain to the users as they are already familiar with flags like -emit-llvm, only now we are dealing with an extra bundling step in the compilation pipeline. It would also behave consistently across CUDA and HIP even though they have different defaults for bundling for the device-side compilation. E.g. -c --cuda-device-only --emit-gpu-bundle will always produce a bundle with the object files for both CUDA and HIP and -c --cuda-device-only --emit-gpu-object will always require single '-o' output.

WDYT? Does it make sense?

Fri, Jun 4, 6:36 AM · Restricted Project
yaxunl committed rGb5dea8701ba9: [HIP] Fix spack HIP device lib detection (authored by yaxunl).
[HIP] Fix spack HIP device lib detection
Fri, Jun 4, 6:13 AM
yaxunl closed D103281: [HIP] Fix spack HIP device lib detection.
Fri, Jun 4, 6:13 AM · Restricted Project

Thu, Jun 3

yaxunl added a comment to D103563: [HIP] Fix amdgcn builtin for long type.

If you want a way to say "some 64-bit type" in the builtins file, we could add that.

Thu, Jun 3, 8:12 PM · Restricted Project
yaxunl accepted D103658: CUDA/HIP: Change device-use-host-var.cu's NOT "external" check to include "addrspace".

LGTM. thanks

Thu, Jun 3, 5:45 PM · Restricted Project
yaxunl committed rGe42def62d8d9: [HIP] Fix amdgcn builtin for long type (authored by yaxunl).
[HIP] Fix amdgcn builtin for long type
Thu, Jun 3, 4:06 PM
yaxunl closed D103563: [HIP] Fix amdgcn builtin for long type.
Thu, Jun 3, 4:06 PM · Restricted Project
yaxunl accepted D103579: [LTO] Fix -fwhole-program-vtables handling after HIP ThinLTO patch.

LGTM. Thanks!

Thu, Jun 3, 1:21 PM · Restricted Project
yaxunl added a comment to D103563: [HIP] Fix amdgcn builtin for long type.

I got regressions in OpenCL tests since long long int is 128 bit in OpenCL. I switched to use W as width modifier in builtin definitions since W is for fixed-length 64 bit int type.

Thu, Jun 3, 8:56 AM · Restricted Project
yaxunl updated the diff for D103563: [HIP] Fix amdgcn builtin for long type.

fixed regressions with OpenCL. Use fixed length 64 bit int type instead of long long type

Thu, Jun 3, 8:54 AM · Restricted Project
yaxunl added a comment to D103281: [HIP] Fix spack HIP device lib detection.

Accepting this, with the comment that I think it's better to follow up with a patch that reverts autodetecting hip from clang based on the spack directory structure.

Thu, Jun 3, 7:29 AM · Restricted Project

Wed, Jun 2

yaxunl requested review of D103563: [HIP] Fix amdgcn builtin for long type.
Wed, Jun 2, 3:31 PM · Restricted Project
yaxunl committed rG61c65d8e4a29: Fix comments in test cuda-kernel-call.cu (authored by yaxunl).
Fix comments in test cuda-kernel-call.cu
Wed, Jun 2, 7:22 AM

Tue, Jun 1

yaxunl committed rG04caa7c3e02f: [CUDA][HIP] Promote const variables to constant (authored by yaxunl).
[CUDA][HIP] Promote const variables to constant
Tue, Jun 1, 6:29 PM
yaxunl closed D103108: [CUDA][HIP] Promote const variables to constant.
Tue, Jun 1, 6:29 PM · Restricted Project
yaxunl committed rGf7e87dd6ff0c: [CUDA][HIP] Change default lang std to c++14 (authored by yaxunl).
[CUDA][HIP] Change default lang std to c++14
Tue, Jun 1, 6:09 PM
yaxunl closed D103221: [CUDA][HIP] Change default lang std to c++14.
Tue, Jun 1, 6:09 PM · Restricted Project
yaxunl added a comment to D101630: [HIP] Add --gpu-bundle-output.

How does nvcc --genco behave when there are multiple GPU arch's? Does it output a fat binary containing multiple ISA's? Also, does it support device-only compilation for intermediate outputs?

It does not allow multiple outputs for -ptx and -cubin compilations, same as clang behaves now:

$ ~/local/cuda-11.3/bin/nvcc -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -ptx foo.cu
nvcc fatal   : Option '--ptx (-ptx)' is not allowed when compiling for multiple GPU architectures

NVCC does allow -E with multiple targets, but it does produce output for only *one* of them.

NVCC does bundle outputs for multiple GPU variants if -fatbin is used.

Tue, Jun 1, 1:21 PM · Restricted Project
yaxunl added inline comments to D103221: [CUDA][HIP] Change default lang std to c++14.
Tue, Jun 1, 1:12 PM · Restricted Project
yaxunl updated the diff for D103221: [CUDA][HIP] Change default lang std to c++14.

fix test

Tue, Jun 1, 1:12 PM · Restricted Project
yaxunl added a comment to D103108: [CUDA][HIP] Promote const variables to constant.

ping

Tue, Jun 1, 12:17 PM · Restricted Project
yaxunl added inline comments to D103221: [CUDA][HIP] Change default lang std to c++14.
Tue, Jun 1, 12:17 PM · Restricted Project
yaxunl updated the diff for D103221: [CUDA][HIP] Change default lang std to c++14.

revised by Artem's comments, and fix tests

Tue, Jun 1, 12:15 PM · Restricted Project

Mon, May 31

yaxunl added a comment to D102723: [HIP] Tighten checks in hip-include-path.hip test case.

It seems we cannot introduce ROOT by line 19 since it is not used in other lines in situations where working directories have sym links.

Instead, we just change {{.*}}clang/ to {{.*}}/lib/clang/ in other lines. Hopefully this will still work.

Ok. That probably works.

A bit surprised though. I found checks using InstalledDir in other test cases such as clang/test/Friver/stdlibxx-isystem.cpp, so I figured it would be safe to use that also in this test case. So I wonder how other tests that captures InstalledDir could work for the symlink-bots.

Mon, May 31, 9:55 AM · Restricted Project
yaxunl added a comment to D102723: [HIP] Tighten checks in hip-include-path.hip test case.

It seems we cannot introduce ROOT by line 19 since it is not used in other lines in situations where working directories have sym links.

Mon, May 31, 8:01 AM · Restricted Project

Fri, May 28

yaxunl added a comment to D101630: [HIP] Add --gpu-bundle-output.

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

--cuda-host-only always have one output, therefore there is no point of bundle its output. We only need to decide the proper behavior of --cuda-device-only.

It still fits my proposal of requiring a single sub-compilation and not bundling the output.
The point was that such behavior is consistent regardless of whether we're compiling CUDA or HIP for the host or for device.

How about keeping the original default behavior of not bundling if users do not specify output file,
whereas bundle the output if users specifying output file.

I think it will make things worse. Compiler output should not change depending on whether -o is used.

Since specifying output file indicates users requesting one output. -f[no-]hip-bundle-device-output override the default behavior.

I disagree. When user specifies the output, the intent is to specify the location of the outputs, not its contents or format.

Telling compiler to produce a different output format should not depend on specifying (or not) the output location.

I think our options are:

  • Always bundle --cuda-device-only outputs by default. This is consistent for HIP compilation, but deviates from CUDA, which can't do bundling. Also, single-target subcompilation deviates from both CUDA and regular C++ compilation, which is what most users would be familiar with and which would probably be the most sensible default for a single sub-compilation. It can be overridden with an option, but it goes against the principle that it's specialized use case that should need extra options. The most common use case should not need them.
  • Only bundle multiple sub-compilations' output by default. This would preserve the sensible single sub-compilation behavior. The downside is that it makes the output format depend on whether compiler ends up doing one or many sub-compilations. E.g. --offload-arch=A -S would produce ASM and --offload-arch=A --offload-arch=B -S would produce a bundle. If the user can't control some of the compiler options, Such approach would make output format unpredictable. E.g. passing --offload-arch=foo to compiler on godbolt would all of a sudden produce bundled output instead of assembly text or a sensible error message that you're trying to produce multiple outputs.
  • Keep the current behavior (insist on single sub-compilation) as the default, allow overriding it for HIP with the flag. IMO that's the most consistent option and I still think it's the one most suitable to keep as the default.

I can see the benefit of always bundling for HIP, but I also believe that keeping things simple, consistent and predictable is important. Considering that we're tinkering in a relatively obscure niche of the compiler, it probably does not matter all that much, but it should not stop us from trying to figure out the best approach in a principled way.

I think we could benefit from a second opinion on which approach would make more sense for clang.
Summoning @jdoerfert and @echristo.

Fri, May 28, 2:16 PM · Restricted Project

Thu, May 27

yaxunl added a comment to D97340: [HIP] Support Spack packages.

opened https://reviews.llvm.org/D103281 to fix device lib detection for spack

Thu, May 27, 1:01 PM · Restricted Project
yaxunl requested review of D103281: [HIP] Fix spack HIP device lib detection.
Thu, May 27, 12:59 PM · Restricted Project
yaxunl committed rG6d2c0950205f: [HIP] Check compatibility of -fgpu-sanitize with offload arch (authored by yaxunl).
[HIP] Check compatibility of -fgpu-sanitize with offload arch
Thu, May 27, 9:07 AM
yaxunl closed D102975: [HIP] Check compatibility of -fgpu-sanitize with offload arch.
Thu, May 27, 9:07 AM · Restricted Project
yaxunl added a comment to D97340: [HIP] Support Spack packages.

Hi Yaxunl,

The patch should not cause circular dependency on HIP or device library.

I'm not saying this patch introduces a circular dependency, I'm saying you are trying to solve an already existing circular dependency (clang needs device libs at runtime, but device libs need llvm to compile).

Let's talk about my PR here: https://github.com/spack/spack/pull/23859. It's building rocm-device-libs as part of llvm-amdgpu by configuring llvm with -DLLVM_EXTERNAL_PROJECTS=device-libs.

If you checkout that pr, run spack install hip@4.2.0, what you get is get is:

$ spack find -p llvm-amdgpu@4.2.0
llvm-amdgpu@4.2.0  /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi

And indeed the bitcode is there:

$ find /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi -iname '*.bc'
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/oclc_isa_version_1033.bc
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/ocml.bc
/spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/amdgcn/bitcode/hip.bc
...

Now when I used this --print-rocm-search-dirs flag on clang without other flags, what I see is:

$ /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/bin/clang++ --print-rocm-search-dirs -x hip hi.cc 
ROCm installation search path (Spack 4.2.0): /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0
ROCm installation search path: /spack/opt/spack/linux-ubuntu20.04-zen2/gcc-10.2.0/llvm-amdgpu-4.2.0-a7jwajhh2cmn7p5djyx42lpcdfluk7wi/lib/clang/12.0.0
ROCm installation search path: /opt/rocm
...
clang-12: error: cannot find ROCm device library. Provide its path via --rocm-path or --rocm-device-lib-path, or pass -nogpulib to build without ROCm device library.

Now can you make it such that clang will search the llvm prefix itself?

Thu, May 27, 8:57 AM · Restricted Project
yaxunl added a comment to D102507: [HIP] Support <functional> in device code.

In effect this patch applies __host__ __device__ to a subset of the standard library headers and whatever headers *they* happen to include. While it may happen to work, I'm not at all confident that it does not create interesting issues.

Considering that the patch only works with libc++ anyways, perhaps it's time to make (parts) of libc++ itself usable from CUDA/HIP, instead of hacking around it in the wrappers?

@rsmith Richard, who would be the right person to discuss the standard library changes we may need?

Thu, May 27, 7:33 AM

Wed, May 26

yaxunl updated the diff for D103108: [CUDA][HIP] Promote const variables to constant.

fix test

Wed, May 26, 8:32 PM · Restricted Project
yaxunl requested review of D103221: [CUDA][HIP] Change default lang std to c++14.
Wed, May 26, 8:18 PM · Restricted Project
yaxunl added inline comments to D103108: [CUDA][HIP] Promote const variables to constant.
Wed, May 26, 6:49 PM · Restricted Project
yaxunl added inline comments to D103108: [CUDA][HIP] Promote const variables to constant.
Wed, May 26, 2:07 PM · Restricted Project
yaxunl updated the diff for D103108: [CUDA][HIP] Promote const variables to constant.

revised by Artem's comments

Wed, May 26, 2:05 PM · Restricted Project
yaxunl updated the diff for D103108: [CUDA][HIP] Promote const variables to constant.

do not promote or check dependent variables since their ctor/dtor/initializers are not determined yet

Wed, May 26, 7:36 AM · Restricted Project

Tue, May 25

yaxunl updated the diff for D102975: [HIP] Check compatibility of -fgpu-sanitize with offload arch.

revised by Artem's comments

Tue, May 25, 1:27 PM · Restricted Project
yaxunl requested review of D103108: [CUDA][HIP] Promote const variables to constant.
Tue, May 25, 11:42 AM · Restricted Project

Mon, May 24

yaxunl updated the summary of D101630: [HIP] Add --gpu-bundle-output.
Mon, May 24, 8:39 AM · Restricted Project
yaxunl updated the diff for D101630: [HIP] Add --gpu-bundle-output.

fixed option. bundle output if users specify output by -o or -E

Mon, May 24, 8:38 AM · Restricted Project
yaxunl added a comment to D101630: [HIP] Add --gpu-bundle-output.

How about this:
If the user explicitly specified --cuda-host-only or --cuda-device-only, then by default only allow producing the natural output format, unless a bundled output is requested by an option. This should keep existing users working.
If the compilation is done without explicitly requested sub-compilation(s), then bundle the output by default. This should keep the GPU-unaware tools like ccache happy as they would always get the single output they expect.

WDYT?

Mon, May 24, 8:37 AM · Restricted Project

Sun, May 23

yaxunl added a comment to D97340: [HIP] Support Spack packages.

Hi @tra and @yaxunl, I'm commenting as a reviewer of the spack pull request for the rocm 4.2.0 ecosystem. First of all: thanks for caring about spack installations, that's highly appreciated.

However, this patch does not seem the right approach to me, for a couple reasons:

  1. LLVM should not be aware of spack -- it doesn't make sense. Spack doesn't even have a stable 1.0 release, expect it to break and to be inconsistent between versions.
  2. It's incorrect in multiple ways: (a) the directory structure name is configurable in spack, not everybody has this <name>-<version>-<hash> structure, so you shouldn't rely on it, (b) you are still not guaranteed to pick up the correct llvm-amdgpu because it may be installed in a chained repo on a shared HPC system, and it won't be in the same prefix folder at all (c) spack's main selling point is it supports multiple installs of one package (a hash also changes for the same llvm version if a downstream dep such as zlib is updated), this patch just bails when it detect multiple installations

Let's step back a bit. The problem you seem to be solving is the circular dependency between llvm and rocm-device-libs which are separate packages in spack; clang can't know where rocm-device-libs is at compile time because rocm-device-libs depends on llvm.

Clearly the solution is to build llvm together with rocm-device-libs, as explained in the readme of https://github.com/RadeonOpenCompute/ROCm-Device-Libs, we can fix that in spack by adding a resource in the llvm package to pull in more sources, and LLVM can happily live without knowing about spack.

Thanks,

Harmen

Sun, May 23, 7:48 AM · Restricted Project

Sat, May 22

yaxunl requested review of D102975: [HIP] Check compatibility of -fgpu-sanitize with offload arch.
Sat, May 22, 9:13 PM · Restricted Project
yaxunl committed rGbf6124580dfb: [HIP] support ThinLTO (authored by yaxunl).
[HIP] support ThinLTO
Sat, May 22, 7:49 AM
yaxunl closed D99683: [HIP] Support ThinLTO.
Sat, May 22, 7:49 AM · Restricted Project, Restricted Project

Fri, May 21

yaxunl added inline comments to D102936: [CUDA] Work around compatibility issue with libstdc++ 11.1.0.
Fri, May 21, 3:18 PM · Restricted Project
yaxunl added inline comments to D100794: [HIP] Support overloaded math functions for hipRTC.
Fri, May 21, 2:47 PM · Restricted Project
yaxunl committed rG91dfd68e9015: [NFC][HIP] fix comments in __clang_hip_cmath.h (authored by yaxunl).
[NFC][HIP] fix comments in __clang_hip_cmath.h
Fri, May 21, 2:45 PM
yaxunl added a comment to D101793: [clang][AST] Improve AST Reader/Writer memory footprint.

Thanks for the approval!

Just want to understand the list of "decls to check for deferred diagnostics" better, where are these decls coming from? And why do they need to be checked for warnings? I see decls from libc are in the list, but I have no idea why are they selected.

For offloading languages e.g. OpenMP/CUDA/HIP, there are apparent errors in functions shared between host and device. However, unless these functions are sure to be emitted on device or host, these errors should not be emitted. These errors are so called deferred error messages. The function decls which need to be checked are recorded. After AST is finalized, the AST of these functions are iterated. If a function is found sure to be emitted, the deferred error message in it are emitted.

Thanks! So the DeclsToCheckForDeferredDiags contains the candidate decls to be checked. The decls are selected because they would generate diags in the context of offloading languages, but whether or not an diag will be emitted is deferred till traversal of the AST is performed. I still don't quite understand why would libc functions be in the candidate list? They look simple enough (and I think they've been there forever). For example __uint16_identity (https://code.woboq.org/userspace/glibc/bits/uintn-identity.h.html#32),

static inline __uint16_t __uint16_identity(__uint16_t __x) {
    return __x;
}

I don't see how this function would generate diag either on host or device.

Fri, May 21, 2:18 PM · Restricted Project

Thu, May 20

yaxunl added a comment to D101793: [clang][AST] Improve AST Reader/Writer memory footprint.

Thanks for the approval!

Just want to understand the list of "decls to check for deferred diagnostics" better, where are these decls coming from? And why do they need to be checked for warnings? I see decls from libc are in the list, but I have no idea why are they selected.

Thu, May 20, 7:02 PM · Restricted Project
yaxunl accepted D101793: [clang][AST] Improve AST Reader/Writer memory footprint.

LGTM. Thanks.

Thu, May 20, 3:16 PM · Restricted Project
yaxunl committed rG4cb42564ec4b: [CUDA][HIP] Fix device variables used by host (authored by yaxunl).
[CUDA][HIP] Fix device variables used by host
Thu, May 20, 2:05 PM
yaxunl closed D102801: [CUDA][HIP] Fix device variables used by host.
Thu, May 20, 2:05 PM · Restricted Project
yaxunl added a comment to D101793: [clang][AST] Improve AST Reader/Writer memory footprint.

Tried to make Sema::DeclsToCheckForDeferredDiags llvm::SmallSetVector. The heap RSS did drop significantly (from peak 100GB to 59GB) , but not as good as the current fix (peak 26GB), which makes ASTReader::DeclsToCheckForDeferredDiags llvm::SmallSetVector.

I think the reason is that the duplicated decls are read from multiple module file sources (ASTReader::ReadAST() -> ASTReader::ReadASTBlock()), then stored into ASTReader::DeclsToCheckForDeferredDiags, then goes into Sema::DeclsToCheckForDeferredDiags in ASTReader::ReadDeclsToCheckForDeferredDiags(). Doing dedup at the early stage when the decls were just read in ASTReader is more effective at reducing RSS.

What if you use SmallSetVector for both Sema::DeclsToCheckForDeferredDiags and ASTReader::DeclsToCheckForDeferredDiags? Does it cause extra memory usage compared to using it only for ASTReader::DeclsToCheckForDeferredDiags? Thanks.

There would be a slight increase in memory usage since SmallSetVector requires more memory than just SmallVector internally, but given the majority the RSS comes from duplicated decls, I don't think it's an issue by making both SmallSetVector. If you think it's better to change both, I'll update the diff.

Thu, May 20, 1:33 PM · Restricted Project
yaxunl added a comment to D101793: [clang][AST] Improve AST Reader/Writer memory footprint.

Tried to make Sema::DeclsToCheckForDeferredDiags llvm::SmallSetVector. The heap RSS did drop significantly (from peak 100GB to 59GB) , but not as good as the current fix (peak 26GB), which makes ASTReader::DeclsToCheckForDeferredDiags llvm::SmallSetVector.

I think the reason is that the duplicated decls are read from multiple module file sources (ASTReader::ReadAST() -> ASTReader::ReadASTBlock()), then stored into ASTReader::DeclsToCheckForDeferredDiags, then goes into Sema::DeclsToCheckForDeferredDiags in ASTReader::ReadDeclsToCheckForDeferredDiags(). Doing dedup at the early stage when the decls were just read in ASTReader is more effective at reducing RSS.

Thu, May 20, 1:17 PM · Restricted Project
yaxunl updated the diff for D102801: [CUDA][HIP] Fix device variables used by host.

revised by Artem's comments

Thu, May 20, 12:55 PM · Restricted Project
yaxunl added inline comments to D102801: [CUDA][HIP] Fix device variables used by host.
Thu, May 20, 12:54 PM · Restricted Project
yaxunl added a comment to D102801: [CUDA][HIP] Fix device variables used by host.

Tentative LGTM as we need it to fix the regression soon.

Summoning @rsmith for the 'big picture' opinion.
While the patch may fix this particular regression, I wonder if there's a better way to deal with this. We're growing a bit too many nuances that would be hard to explain and may cause more corner cases to appear.

Thu, May 20, 11:05 AM · Restricted Project
yaxunl added a comment to D102801: [CUDA][HIP] Fix device variables used by host.

This patch does not appear to fix the second regression introduced by the D102237.

Trying to compile the following code triggers an assertion in CGExpr.cpp:

class a {
public:
  a(char *);
};
void b() {
  [](char *c) {
    static a d(c);
    d;
  };
}

With assertions disabled it eventually leads to a different error:
Module has a nontrivial global ctor, which NVPTX does not support.
https://godbolt.org/z/sYE1dKr1W

Thu, May 20, 10:57 AM · Restricted Project
yaxunl updated the diff for D102801: [CUDA][HIP] Fix device variables used by host.

Fix the other regression

Thu, May 20, 10:46 AM · Restricted Project

Wed, May 19

yaxunl added a comment to D102237: [CUDA][HIP] Fix non-ODR-use of static device variable.

Here's a slightly simpler reproducer: https://godbolt.org/z/rW6P9e37s

Wed, May 19, 1:05 PM · Restricted Project
yaxunl requested review of D102801: [CUDA][HIP] Fix device variables used by host.
Wed, May 19, 12:55 PM · Restricted Project

Tue, May 18

yaxunl added a comment to D102237: [CUDA][HIP] Fix non-ODR-use of static device variable.

Sam, this patch has apparently triggered some unwanted side effects. I'm still reducing the failures to something that could be used for debugging, but the rough symptoms are:

We now end up emitting the code for the host-only static member functions of instantiated class templates during device-side compilation.
Clang now complains about not allowed nontrivial static initializers for some variables in class templates.

Tue, May 18, 2:33 PM · Restricted Project
yaxunl accepted D102723: [HIP] Tighten checks in hip-include-path.hip test case.

LGTM. Thanks.

Tue, May 18, 1:46 PM · Restricted Project

Mon, May 17

yaxunl committed rG18cb17ce4cd5: [HIP] Fix spack detection (authored by yaxunl).
[HIP] Fix spack detection
Mon, May 17, 10:24 AM
yaxunl closed D102556: [HIP] Fix spack detection.
Mon, May 17, 10:24 AM · Restricted Project
yaxunl added inline comments to D102556: [HIP] Fix spack detection.
Mon, May 17, 9:46 AM · Restricted Project

Sat, May 15

yaxunl requested review of D102556: [HIP] Fix spack detection.
Sat, May 15, 10:22 AM · Restricted Project

Fri, May 14

yaxunl added a comment to D101793: [clang][AST] Improve AST Reader/Writer memory footprint.

I think the root cause might be duplicated decls are added to Sema::DeclsToCheckForDeferredDiags defined in

Fri, May 14, 1:21 PM · Restricted Project
yaxunl added a comment to D102508: [HIP] Add test libstd_functional.

Looks good, let me verify on HIP builder.

Fri, May 14, 9:27 AM
yaxunl updated the summary of D102507: [HIP] Support <functional> in device code.
Fri, May 14, 9:22 AM
yaxunl requested review of D102508: [HIP] Add test libstd_functional.
Fri, May 14, 9:16 AM
yaxunl requested review of D102507: [HIP] Support <functional> in device code.
Fri, May 14, 9:10 AM

May 13 2021

yaxunl accepted D102427: [HIP] Clean up llvm intrinsics using __asm.

LGTM. Thanks.

May 13 2021, 11:42 AM · Restricted Project
yaxunl accepted D102403: [HIP] Add __builtin_amdgcn_groupstaticsize.

LGTM. Thanks.

May 13 2021, 8:36 AM · Restricted Project

May 12 2021

yaxunl committed rGce6cc87ce9e9: [clang] Minor fix for MarkVarDeclODRUsed (authored by yaxunl).
[clang] Minor fix for MarkVarDeclODRUsed
May 12 2021, 7:33 PM
yaxunl closed D102270: [CUDA][HIP] Fix device template variables.
May 12 2021, 2:14 PM
yaxunl updated the diff for D99683: [HIP] Support ThinLTO.

Revised by Teresa's comments

May 12 2021, 1:50 PM · Restricted Project, Restricted Project
yaxunl added inline comments to D99683: [HIP] Support ThinLTO.
May 12 2021, 1:43 PM · Restricted Project, Restricted Project
yaxunl committed rG98575708da95: [CUDA][HIP] Fix device template variables (authored by yaxunl).
[CUDA][HIP] Fix device template variables
May 12 2021, 8:14 AM
yaxunl closed D102237: [CUDA][HIP] Fix non-ODR-use of static device variable.
May 12 2021, 8:14 AM · Restricted Project
yaxunl added inline comments to D99683: [HIP] Support ThinLTO.
May 12 2021, 7:00 AM · Restricted Project, Restricted Project