This is an archive of the discontinued LLVM Phabricator instance.

Increase inline threshold multiplier to 11 in nvptx backend.
ClosedPublic

Authored by JackAKirk on Jan 20 2023, 9:14 AM.

Details

Summary

I used https://github.com/zjin-lcf/HeCBench (with nvcc usage swapped to clang++), which is an adaptation of the classic Rodinia benchmarks aimed at CUDA and SYCL programming models, to compare different values of the multiplier using both clang++ cuda and clang++ sycl nvptx backends.
I find that the value is currently too low for both cases. Qualitatively (and in most cases there is very a close quantitative agreement across both cases) the change in code execution time for a range of values from 5 to 1000 matches in both variations (CUDA clang++ vs SYCL (with cuda backend) using the intel/llvm clang++ compiler) of the HeCbench samples.
This value of 11 is optimal for clang++ cuda for all cases I've investigated. I have not found a single case where performance is deprecated by this change of the value from 5 to 11. For one sample the sycl cuda backend preferred a higher value. However we are happy to prioritize clang++ cuda, and we find that this value is close to ideal for both cases anyway.
It would be good to do some further investigation using clang++ openmp cuda offload. However since I do not know of an appropriate set of benchmarks for this case, and the fact that we are now getting complaints about register spills related to insufficient inlining on a weekly basis, we have decided to propose this change and potentially seek some more input from someone who may have more expertise in the openmp case.
Incidentally this value coincides with the value used for the amd-gcn backend. We have also been able to use the amd backend of the intel/llvm "dpc++" compiler to compare the inlining behaviour of an identical code when targetting amd (compared to nvptx). Unsurprisingly the amd backend with a multiplier value of 11 was performing better (with regard to inlining) than the nvptx case when the value of 5 was used. When the two backends use the same multiplier value the inlining behaviors appear to align closely.

This also considerably improves the performance of at least one of the most popular HPC applications: NWCHEMX.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Diff Detail

Event Timeline

JackAKirk created this revision.Jan 20 2023, 9:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2023, 9:14 AM
JackAKirk requested review of this revision.Jan 20 2023, 9:14 AM
arsenm added a subscriber: arsenm.Jan 20 2023, 9:16 AM
arsenm added inline comments.
llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
93

Comment is out of date, and naming the value here is kind of pointless to begin with

ldrumm added a subscriber: ldrumm.Jan 20 2023, 9:16 AM
JackAKirk retitled this revision from Increase inline threshold multiplier to 11 in nvptx backend. I used https://github.com/zjin-lcf/HeCBench (with nvcc usage swapped to clang++), which is an adaptation of the classic Rodinia benchmarks aimed at CUDA and SYCL programming models, to... to Increase inline threshold multiplier to 11 in nvptx backend..Jan 20 2023, 9:22 AM
JackAKirk edited the summary of this revision. (Show Details)
JackAKirk added a subscriber: tra.

Did you study the compile-time impact of bumping the inlining threshold? While I would expect perf benefits in some cases from such a change, in my experience this threshold increase can also cause significant compile-time overhead in some cases.

tra added a comment.EditedJan 20 2023, 11:26 AM

Bumping inlining threshold is generally beneficial for GPU performance. Due to a quirk in PTX spec, we often have to create local copies of byval function arguments and that is very expensive performance wise, with the only way to mitigate the issue to force-inline the functions. It becomes particularly painful in modern lambda-heavy C++ code.

That said, in my experience, there are relatively few cases where such a bump is really needed and in a lot of them explicitly force-inlining the culprit function was sufficient. While I see how the change is beneficial for performance, I'm on the fence whether that benefit is worth everyone paying the incremental compile increase cost.

his value of 11 is optimal for clang++ cuda for all cases I've investigated.

It's very likely that the subset is not representative. I'm sure it will help the benchmarks, but that's not what most of the people usually compile.
Measuring impact on a somewhat realistic workload would be very helpful for establishing that the trade-off is worth it.

I would suggest checking how the change affects compilation of these projects:

tra added a reviewer: tra.Jan 20 2023, 11:26 AM
JackAKirk added a comment.EditedJan 23 2023, 8:51 AM

Bumping inlining threshold is generally beneficial for GPU performance. Due to a quirk in PTX spec, we often have to create local copies of byval function arguments and that is very expensive performance wise, with the only way to mitigate the issue to force-inline the functions. It becomes particularly painful in modern lambda-heavy C++ code.

That said, in my experience, there are relatively few cases where such a bump is really needed and in a lot of them explicitly force-inlining the culprit function was sufficient. While I see how the change is beneficial for performance, I'm on the fence whether that benefit is worth everyone paying the incremental compile increase cost.

his value of 11 is optimal for clang++ cuda for all cases I've investigated.

It's very likely that the subset is not representative. I'm sure it will help the benchmarks, but that's not what most of the people usually compile.
Measuring impact on a somewhat realistic workload would be very helpful for establishing that the trade-off is worth it.

I would suggest checking how the change affects compilation of these projects:

Thanks for all the comments. We agree that we should do more testing of compile time/execution time for samples using some more larger projects. I am currently investigating tensorflow. Will post updates soon.

tra added a comment.Jan 23 2023, 11:05 AM

I would expect thrust to be close to the worst-case scenario, and it also has a pretty extensive set of tests to compile. If there's no major compile-time regression on thrust tests, I'll be fine with the patch.

I am currently investigating tensorflow. Will post updates soon.

Keep in mind that GPU-related part of TF compilation is relatively small, compared to the rest of the build, so results there may be noisy.
The other projects on the list have more GPU code to compile. They may also be much easier to compile, compared to tensorflow.

You may also add https://github.com/NVIDIA/nccl to the list (It's also part of TF build).

I would expect thrust to be close to the worst-case scenario, and it also has a pretty extensive set of tests to compile. If there's no major compile-time regression on thrust tests, I'll be fine with the patch.

I am currently investigating tensorflow. Will post updates soon.

Keep in mind that GPU-related part of TF compilation is relatively small, compared to the rest of the build, so results there may be noisy.
The other projects on the list have more GPU code to compile. They may also be much easier to compile, compared to tensorflow.

You may also add https://github.com/NVIDIA/nccl to the list (It's also part of TF build).

OK, I've looked into building thrust with clang cuda. Ran into some issues: https://github.com/NVIDIA/thrust/issues/1853

tra added a comment.Jan 24 2023, 10:57 AM

OK, I've looked into building thrust with clang cuda. Ran into some issues: https://github.com/NVIDIA/thrust/issues/1853

Newer versions of thrust may have issues w/ clang. In the past it regularly needed portability fixes. An old thrust revision f5ea60fd3aa3828c0eb8991a54acdfbed6707bd7 should be buildable w/ clang, though the CMakeFiles there may be too old to support clang as the cuda compiler. If you run into too much trouble, just skip it.

cutlass and nccl may be in better shape. Sorry about being vague -- we do compile cutlass/nccl/thrust with clang, but not always a recent version and we're not relying on cmake to do it, so I can't say what's the state of the official build of those projects when it comes to using clang as a CUDA compiler.
For the purposes of this experiment a quick-and-dirty solution of configuring the build to use nvcc, capturing the commands run by the build, editing them to replace NVCC and NVCC-specific options with clang equivalents, and running those commands as a script may do the trick.

I would expect thrust to be close to the worst-case scenario, and it also has a pretty extensive set of tests to compile. If there's no major compile-time regression on thrust tests, I'll be fine with the patch.

I am currently investigating tensorflow. Will post updates soon.

Keep in mind that GPU-related part of TF compilation is relatively small, compared to the rest of the build, so results there may be noisy.
The other projects on the list have more GPU code to compile. They may also be much easier to compile, compared to tensorflow.

You may also add https://github.com/NVIDIA/nccl to the list (It's also part of TF build).

Do you know of any clear instructions to build any branch of cutlass, eigen, or thrust with clang cuda tip? Thrust is currently broken on clang cuda (https://github.com/NVIDIA/thrust/issues/1853#issuecomment-1401952879) and I haven't had any more luck trying to build the others.
Although I have apparently been able to build tensorflow with clang cuda.

OK, I've looked into building thrust with clang cuda. Ran into some issues: https://github.com/NVIDIA/thrust/issues/1853

Newer versions of thrust may have issues w/ clang. In the past it regularly needed portability fixes. An old thrust revision f5ea60fd3aa3828c0eb8991a54acdfbed6707bd7 should be buildable w/ clang, though the CMakeFiles there may be too old to support clang as the cuda compiler. If you run into too much trouble, just skip it.

cutlass and nccl may be in better shape. Sorry about being vague -- we do compile cutlass/nccl/thrust with clang, but not always a recent version and we're not relying on cmake to do it, so I can't say what's the state of the official build of those projects when it comes to using clang as a CUDA compiler.
For the purposes of this experiment a quick-and-dirty solution of configuring the build to use nvcc, capturing the commands run by the build, editing them to replace NVCC and NVCC-specific options with clang equivalents, and running those commands as a script may do the trick.

OK, I've looked into building thrust with clang cuda. Ran into some issues: https://github.com/NVIDIA/thrust/issues/1853

Newer versions of thrust may have issues w/ clang. In the past it regularly needed portability fixes. An old thrust revision f5ea60fd3aa3828c0eb8991a54acdfbed6707bd7 should be buildable w/ clang, though the CMakeFiles there may be too old to support clang as the cuda compiler. If you run into too much trouble, just skip it.

cutlass and nccl may be in better shape. Sorry about being vague -- we do compile cutlass/nccl/thrust with clang, but not always a recent version and we're not relying on cmake to do it, so I can't say what's the state of the official build of those projects when it comes to using clang as a CUDA compiler.
For the purposes of this experiment a quick-and-dirty solution of configuring the build to use nvcc, capturing the commands run by the build, editing them to replace NVCC and NVCC-specific options with clang equivalents, and running those commands as a script may do the trick.

Cheers.

So here is the situation:

Summary: we plan to raise the inlining threshold multiplier in intel/llvm to 11, and you may wish to do the same upstream:

I looked into building cutlass etc with clang_cuda but currently their cmake doesn't easily support clang_cuda.
You can now build thrust using their cmake with clang_cuda using: https://github.com/NVIDIA/thrust/issues/1853#issuecomment-1402287672
When building thrust with clang_cuda, and in all other applications I have built with clang_cuda or dpcpp cuda backend, I find zero dependence of the threshold multiplier on compile time. I even tried building thrust with the threshold multiplier set to 1000 (instead of 5), and compile time of thrust didn't change.
I also cannot find any effect on the multiplier on execution time of thrust tests (note that I did not examine every test well enough to be sure that there is zero effect with statistical significance, but if there was an effect either way it must have been very small). I don't think this is very surprising considering that thrust marks things inline very frequently, and that clang_cuda appears to be generally less sensitive to the inlining threshold than dpc++ cuda. The one sample I have where clang_cuda had a very large performance improvement (there were some others in HeCbench with smaller clang_cuda execution time improvements when raising to 11) in raising the multiplier to 11 was this sample (https://github.com/zjin-lcf/HeCBench/tree/master/bonds-cuda). You can see that there are lots of levels of function calls that are not marked inline in this code.
Someone else on my team investigated NWCHEMX using dpcpp cuda and found a big improvement with raising the multiplier to 11. I've also investigated GROMACS compiled with clang_cuda and dpcpp cuda. I find that again clang_cuda has no change in performance depending on the multiplier value (5 vs 11), (One kernel looked like it could be a couple of percent faster with a multiplier value of 11 but I didn't run it enough times for this to be statistically significant to many sigma). However again dpcpp cuda was much more performant when the value was raised to 11 (https://gitlab.com/gromacs/gromacs/-/issues/3928#note_1259844220).
Generally then the multiplier value of 5 appears to be usually good enough for clang_cuda, but from my testing a value of 11 is still better.

tra accepted this revision.Feb 1 2023, 10:48 AM

Do you know of any clear instructions to build any branch of cutlass, eigen, or thrust with clang cuda tip? Thrust is currently broken on clang cuda (https://github.com/NVIDIA/thrust/issues/1853#issuecomment-1401952879) and I haven't had any more luck trying to build the others.
Although I have apparently been able to build tensorflow with clang cuda.

Tensorflow is the only open-source build using clang to compile CUDA that I'm aware of. That build also includes NCCL, so you can sort of count that in, too.

Summary: we plan to raise the inlining threshold multiplier in intel/llvm to 11, and you may wish to do the same upstream:

Are you saying that you do not plan to land this patch upstream?

I looked into building cutlass etc with clang_cuda but currently their cmake doesn't easily support clang_cuda.

I'll try to get their CMake files improved to work with clang. The recent versions of cutlass *are* buildable with clang, with a few patches that should've been upstreamed by now.

You can now build thrust using their cmake with clang_cuda using: https://github.com/NVIDIA/thrust/issues/1853#issuecomment-1402287672

Good to know.

When building thrust with clang_cuda, and in all other applications I have built with clang_cuda or dpcpp cuda backend, I find zero dependence of the threshold multiplier on compile time. I even tried building thrust with the threshold multiplier set to 1000 (instead of 5), and compile time of thrust didn't change.

For what it's worth, the issues related to inlining/unrolling thresholds I've ran into in the past fall into these broad categories:

  • thresholds are too small. GPU code heavily relies on inlining and loop unrolling for performance. This is the most common scenario. Standard LLVM thresholds don't always work well on a GPU.
  • thresholds are too high. The kernels become too large and start spilling registers into memory, which is very expensive. This mostly affects loop unrolling. I think I only had to add noinline once.
  • Compilation slowdown due to IR size explosion. Needs a lot of inlining candidates just under the threshold. Tends to happen with template-heavy code. However, compiler also tends to spend a lot of compilation in the c++ front-end in such cases, so the optimization time increase is not always prominent.
  • I think in all/most really problematic cases the root cause of the slowdown was some other optimization pass with superlinear complexity.

I also cannot find any effect on the multiplier on execution time of thrust tests

I'd say it's unsurprising. As I mentioned before, there are relatively few cases where it's really needed in practice.

However again dpcpp cuda was much more performant when the value was raised to 11 (https://gitlab.com/gromacs/gromacs/-/issues/3928#note_1259844220).

How does performance compare between clang and dpcpp and clang?

Generally then the multiplier value of 5 appears to be usually good enough for clang_cuda, but from my testing a value of 11 is still better.

I think you did sufficient due diligence checks to show that 11 does not produce obvious negative side effects. Incremental IR growth and compilation time increase is a reasonable trade-off for improving a known weakness we have in NVPTX. I think it's a net gain.

I would suggest landing the patch in upstream LLVM.

That said, as with any heuristic change, there's a chance that it may have unforeseen consequences. If/when that happens, we can easily roll it back and users should be able to override the threshold manually.

I'll keep an eye on whether it causes any issues on our builds.

This revision is now accepted and ready to land.Feb 1 2023, 10:48 AM

For many cases, this threshold bump may not increase inlining if the old threshold was already aggressively inlining everything. In such cases, no compile time impact will be observed. For your motivating testcase that needs this threshold bump, did you measure how this impacted the compile time?

JackAKirk added a comment.EditedFeb 3 2023, 3:52 AM

Summary: we plan to raise the inlining threshold multiplier in intel/llvm to 11, and you may wish to do the same upstream:

Are you saying that you do not plan to land this patch upstream?

No I think it would be ideal to land it upstream. It is just that we will make the change in intel/llvm regardless, because it is especially important for dpc++.

However again dpcpp cuda was much more performant when the value was raised to 11 (https://gitlab.com/gromacs/gromacs/-/issues/3928#note_1259844220).

How does performance compare between clang and dpcpp and clang?

It seems to depend, but usually dpcpp cuda is slower than clang cuda for a few different reasons that I don't understand extremely well. One straightforward reason is that dpcpp always compiles for at least two targets (host and device). There is also the fact that by design dpc++ is using C++ on purpose and quite a lot of templating which can slow down compilation wrt CUDA. There are quite a few potential improvements to dpcpp compilation time that I think people are actively working on.
For the HeCbench sample most affected by the inline threshold bump (for both clang_cuda and dpcpp):
https://github.com/zjin-lcf/HeCBench/tree/master/bonds-cuda (adjusted to be compiled with clang_cuda) is almost exactly four times faster than the corresponding dpcpp code : https://github.com/zjin-lcf/HeCBench/tree/master/bonds-sycl (for both 5 and 11 values) (the execution times is almost exactly the same for dpcpp cuda and clang_cuda however).
For clang_cuda the total build time (either real or (usr + sys)) of bonds-cuda was 35% slower when using 11 compared to 5, but the execution time was more than twice as fast at the value of 11. I found these values using four different compilations/executions (for both 5 and 11) and the standard_error of the mean of these runs was tiny compared to the differences in times such that I am confident of the 35% slowdown. I also ran the benchmarks on a completely separate occasion to have a quick check for any long-time correlations (note that I had unique access to the cluster so did not expect any effects from other users activity so I don't mean this by long-time correlations) and I observed the same times. For the corresponding dpcpp cuda code there was a corresponding observed slowdown in compilation time from 5 to 11, but it was smaller at around 10%, which I think probably makes sense given the larger overhead when compiling with dpcpp.
For gromacs, from what I understand, a direct comparison in compile time between dpcpp cuda and clang_cuda is not exactly fair, since gromacs supports e.g. gpu offload use is used in at least slightly different ways for each. Using make -j 128 I did two compilation runs using a multiplier value of 5 with clang_cuda and found that real time was 3m20.335s and 3m31.028s respectively for these two runs. I imagine that the real time for many threads could be quite noisy and the mean may be quite different from these values. The (usr + sys) timings for these runs was, to the nearest minute, 116minutes and 114minutes respectively. The corresponding times for two compilation runs of dpcpp cuda and a value of 5 was 2m1.756s and 1m58.822s (real), and (sys + usr) 55 and 54 minutes respectively.
Then, for dpcpp cuda the corresponding compilations times using the 11 value were basically the same. However for clang_cuda I appeared to observe a slow-down in real compilation time. The real time for was 4m31.509s but (usr + sys) was unchanged at 114 minutes. Note that using make -j 2 showed no change in compilation time for clang_cuda from 11 wrt 5 (actually a value of 11 was slightly faster but I only made one measurement and this change is probably attributed to noise.). I don't know but I imagine that the distribution of compilation time when using 128 threads could be very complicated, so I'm not sure if any real conclusions can be made from these gromacs measurements. I've just included them for clarity.

I would suggest landing the patch in upstream LLVM.

That said, as with any heuristic change, there's a chance that it may have unforeseen consequences. If/when that happens, we can easily roll it back and users should be able to override the threshold manually.

I'll keep an eye on whether it causes any issues on our builds.

I'd agree that from the available information, a multiplier value of 11 appears a bit better for clang_cuda too.

Thanks for all the other comments!

For many cases, this threshold bump may not increase inlining if the old threshold was already aggressively inlining everything. In such cases, no compile time impact will be observed. For your motivating testcase that needs this threshold bump, did you measure how this impacted the compile time?

Please see my above comment.

tra added a comment.Feb 3 2023, 10:03 AM

Thank you for the details.

So, on the balance, it appears not to hurt the cases where threshold increase does not have material impact on the compilation time, and in relatively few cases where it does affect compilation time, it does provide sufficient performance improvement. I think it's worth it.

Ship it.

JackAKirk updated this revision to Diff 494686.Feb 3 2023, 10:27 AM

Updated multiplier threshold comment.

JackAKirk marked an inline comment as done.Feb 3 2023, 10:34 AM

Thank you for the details.

So, on the balance, it appears not to hurt the cases where threshold increase does not have material impact on the compilation time, and in relatively few cases where it does affect compilation time, it does provide sufficient performance improvement. I think it's worth it.

Ship it.

OK thanks. I've updated the comment swapping 5->11, so I'm happy for it to be merged now.