Page MenuHomePhabricator

[zorg] Add HIP builder script
AcceptedPublic

Authored by ashi1 on Apr 7 2021, 11:53 AM.

Details

Summary

Simple script for HIP builder to build
llvm-project incrementally, then build and
execute HIP tests from llvm-test-suite.

Diff Detail

Event Timeline

ashi1 requested review of this revision.Apr 7 2021, 11:53 AM
ashi1 planned changes to this revision.
ashi1 created this revision.

Please note, this patch is under-development, and I've added it here, as open uncommitted review to allow HIP builder to use external script.
https://reviews.llvm.org/D99894

tra added a comment.Apr 7 2021, 1:50 PM

Looks reasonable overall.

Few drive-by comments below for the pitfalls you may eventually run into later.

buildbot/amd/hip-build.sh
59–78 ↗(On Diff #335889)

Reusing the build directly will work most of the time, but it will likely have issues now and then. Based on my CUDA bots, I'd say few times a year. Not a very big deal, but something to consider if you want the bot to just work all the time.

Examples of potential sources of trouble:
cmake reusing cached configuration may sometimes fail, because some of the cached items no longer reflect the build environment.
I've seen ninja getting stuck in endless reconfiguration cycle, complaining about missing files or endlessly rebuilding something because because something went wrong with the dependencies it recorded during the previous build.

On the old CUDA bot I've eventually settled on cleaning the build directory + using ccache to speed up rebuilding.

80–89 ↗(On Diff #335889)

This is fine for proof-of-concept bot.

However, this will be the bottleneck for your bot.

Clang building and testing is already thoroughly covered by tons of other bots. Ideally we want to avoid doing a full build/test, but rather build/test it once and then distribute the binaries to all the testsuite bots. On CUDA bots I have one VM continuously building/testing llvm and clang, and a handful machines that build and run testsuite on different GPUs. This cuts down on the commit-to-test-results cycle to just a few minutes which allows pinpointing failures to a single commit. With the full build/test which takes quite a bit longer, you often deal with multiple commits which complicates figuring out the offending one.

107–111 ↗(On Diff #335889)

You're building the test for multiple GPUs.
Do you plan to run the test executables on all GPU variants, or only on one?
If it runs one one GPU only, you may want to have a loop here to run the tests on each of the available GPUs

ashi1 added a comment.Apr 9 2021, 11:36 AM

Looks reasonable overall.

Few drive-by comments below for the pitfalls you may eventually run into later.

Thank you for the comments, those are important issues we should address now, if not later. I had a few questions for the comments below.

buildbot/amd/hip-build.sh
59–78 ↗(On Diff #335889)

That is a problem we want to fix, maybe not immediately. I will look into ccache in upcoming patches. How does your multiple test-VMs deal with ccache when they are pulling the builder-VM's binaries?

80–89 ↗(On Diff #335889)

I'm wondering if its okay to skip the check-clang step for this bot. I like your idea of having one VM performing that check, and other bots re-using that build. Are you separating your bots between build-only (plus check-clang) and execution-only (on GPU) ? How do you deal with commits you've found to cause regression?

107–111 ↗(On Diff #335889)

Thanks, we will probably start with one GPU, and start adding more machines with other GPUs. Do your buildbot machines have more than 1 GPU, and you configure which one gets run using CUDA_VISIBLES_DEVICES?

tra added inline comments.Apr 12 2021, 2:45 PM
buildbot/amd/hip-build.sh
59–78 ↗(On Diff #335889)

Ccache is used to build LLVM only.
Test binaries are configured and built with the fresh clang from scratch, without caching. Building the subset of tests that we're using right now does not take much time, so it's not a big deal.

Eventually we may want to test something more convoluted, like eigen3, cub or thrust and that will take much more time to build. When we get there, I want to push compilation of the tests to the build VM (or, possibly, to another VM, if building the tests takes too long. For now building clang and building&running the tests are fairly well balanced -- builds are fast enough to catch single LLVM/clang commits most of the time and the tests are fast enough to be done on every build the builder makes.

80–89 ↗(On Diff #335889)

At the moment the build VM produces clang binaries for the test VMs before it gets to run the tests on clang itself.

I think it provides a minor net benefit over running CUDA tests with the tested-all-green clang binaries. IMO there's no harm running test-suite CUDA tests with the clang, even if it may have failed some tests. In the worst case, we may have a useless test failure which we can ignore until clang is fixed. On a positive side, we may have another failure which may be useful for understanding the failure.

Again, my anecdotal experience is that this approach provides way better signa/noise ratio for CUDA tests. Most of the test failures in clang have nothing to do with CUDA. I'm willing to tolerate an occasional easy-to-identify false positive as the price of not having to deal with frequent failures I don't care about. So far, the bots did pretty good job catching the real issues and I've seen no false positives.

107–111 ↗(On Diff #335889)

The bots run on Google Cloud now, so there's only one kind of GPU per bot.

Previous incarnation of the CUDA buildbot ran on a local machine with multiple GPUs and I indeed used to have the list of GPUs to run the tests on, one GPU at a time, controlled via CUDA_VISIBLES_DEVICES.

ashi1 marked 3 inline comments as done.Apr 20 2021, 6:56 AM

I may be moving this patch and combining with D99894 , since we plan to add this to the public repo.

buildbot/amd/hip-build.sh
59–78 ↗(On Diff #335889)

That sounds good, we may also want to add larger builds later and may benefit from building VM.

80–89 ↗(On Diff #335889)

That is a good point, we will not require running HIP tests on all-green clang binaries either. At least those will be caught by other clang specific testers, and be redundant here. I'm interested in getting this up to catch those issues soon.

107–111 ↗(On Diff #335889)

It would be nice to also run our bots on the cloud. For now, we have a single vega20, although building for various targets to catch build issues. Does the Google Cloud have an options to test on AMD GPUs?

tra added inline comments.Apr 20 2021, 9:49 AM
buildbot/amd/hip-build.sh
107–111 ↗(On Diff #335889)

Unfortunately, no, AMD GPUs are not available on GCE.

Using a single-GPU VM as the standard bot setup should work on multi-GPU machines, too. Just start one container per GPU, same as you'd do if/when AMD GPUs are available in the cloud.

ashi1 updated this revision to Diff 339640.Apr 22 2021, 8:10 AM
ashi1 marked 3 inline comments as done.

Moving this script to annotated/ directory. The AnnotatedBuilder.py now support running bash scripts.

Also, keeping this as a local uncommitted script during bring-up/staging.

tra accepted this revision.Apr 22 2021, 9:54 AM

General style nit: the script has very inconsistent quoting for the variables. They are quoted in some places but not others.

LGTM otherwise.

This revision is now accepted and ready to land.Apr 22 2021, 9:54 AM
zorg/buildbot/builders/annotated/hip-build.sh
66

Including -DLLVM_ENABLE_RUNTIMES="openmp" in this list should suffice to produce a toolchain that can run openmp offloading tests as well as the hip ones

tra added inline comments.Apr 22 2021, 1:14 PM
zorg/buildbot/builders/annotated/hip-build.sh
66

This should be easy to do. Let me know when you have some OMP tests in the test-suite you'd like to run on NVIDIA GPUs and I'll try adding them to my CUDA bots.

ashi1 added a comment.Apr 29 2021, 2:19 PM

@tra have you seen this exception in your CUDA buildbots before?
https://lab.llvm.org/staging/#/builders/152/builds/3

tra added a comment.Apr 29 2021, 3:20 PM

@tra have you seen this exception in your CUDA buildbots before?
https://lab.llvm.org/staging/#/builders/152/builds/3

Sorry, I haven't seen it before and Can't tell what exactly is the bot unhappy about.
It says unsupported operand type(s) for %: 'WithProperties' , but it's not clear why.

@tra have you seen this exception in your CUDA buildbots before?
https://lab.llvm.org/staging/#/builders/152/builds/3

Sorry, I haven't seen it before and Can't tell what exactly is the bot unhappy about.
It says unsupported operand type(s) for %: 'WithProperties' , but it's not clear why.

Looks like it was unrelated to the script, it was a bug fixed in D101575. Now the simple HIP buildbot is operating in staging/silent mode. Thanks for all the reviews.