Simple script for HIP builder to build
llvm-project incrementally, then build and
execute HIP tests from llvm-test-suite.
Details
- Reviewers
tra gkistanova yaxunl - Commits
- rZORG1a32648b32ec: [zorg] Add HIP builder script
Diff Detail
- Repository
- rZORG LLVM Github Zorg
Event Timeline
Please note, this patch is under-development, and I've added it here, as open uncommitted review to allow HIP builder to use external script.
https://reviews.llvm.org/D99894
Looks reasonable overall.
Few drive-by comments below for the pitfalls you may eventually run into later.
buildbot/amd/hip-build.sh | ||
---|---|---|
59–78 | Reusing the build directly will work most of the time, but it will likely have issues now and then. Based on my CUDA bots, I'd say few times a year. Not a very big deal, but something to consider if you want the bot to just work all the time. Examples of potential sources of trouble: On the old CUDA bot I've eventually settled on cleaning the build directory + using ccache to speed up rebuilding. | |
80–89 | This is fine for proof-of-concept bot. However, this will be the bottleneck for your bot. Clang building and testing is already thoroughly covered by tons of other bots. Ideally we want to avoid doing a full build/test, but rather build/test it once and then distribute the binaries to all the testsuite bots. On CUDA bots I have one VM continuously building/testing llvm and clang, and a handful machines that build and run testsuite on different GPUs. This cuts down on the commit-to-test-results cycle to just a few minutes which allows pinpointing failures to a single commit. With the full build/test which takes quite a bit longer, you often deal with multiple commits which complicates figuring out the offending one. | |
107–111 | You're building the test for multiple GPUs. |
Thank you for the comments, those are important issues we should address now, if not later. I had a few questions for the comments below.
buildbot/amd/hip-build.sh | ||
---|---|---|
59–78 | That is a problem we want to fix, maybe not immediately. I will look into ccache in upcoming patches. How does your multiple test-VMs deal with ccache when they are pulling the builder-VM's binaries? | |
80–89 | I'm wondering if its okay to skip the check-clang step for this bot. I like your idea of having one VM performing that check, and other bots re-using that build. Are you separating your bots between build-only (plus check-clang) and execution-only (on GPU) ? How do you deal with commits you've found to cause regression? | |
107–111 | Thanks, we will probably start with one GPU, and start adding more machines with other GPUs. Do your buildbot machines have more than 1 GPU, and you configure which one gets run using CUDA_VISIBLES_DEVICES? |
buildbot/amd/hip-build.sh | ||
---|---|---|
59–78 | Ccache is used to build LLVM only. Eventually we may want to test something more convoluted, like eigen3, cub or thrust and that will take much more time to build. When we get there, I want to push compilation of the tests to the build VM (or, possibly, to another VM, if building the tests takes too long. For now building clang and building&running the tests are fairly well balanced -- builds are fast enough to catch single LLVM/clang commits most of the time and the tests are fast enough to be done on every build the builder makes. | |
80–89 | At the moment the build VM produces clang binaries for the test VMs before it gets to run the tests on clang itself. I think it provides a minor net benefit over running CUDA tests with the tested-all-green clang binaries. IMO there's no harm running test-suite CUDA tests with the clang, even if it may have failed some tests. In the worst case, we may have a useless test failure which we can ignore until clang is fixed. On a positive side, we may have another failure which may be useful for understanding the failure. Again, my anecdotal experience is that this approach provides way better signa/noise ratio for CUDA tests. Most of the test failures in clang have nothing to do with CUDA. I'm willing to tolerate an occasional easy-to-identify false positive as the price of not having to deal with frequent failures I don't care about. So far, the bots did pretty good job catching the real issues and I've seen no false positives. | |
107–111 | The bots run on Google Cloud now, so there's only one kind of GPU per bot. Previous incarnation of the CUDA buildbot ran on a local machine with multiple GPUs and I indeed used to have the list of GPUs to run the tests on, one GPU at a time, controlled via CUDA_VISIBLES_DEVICES. |
I may be moving this patch and combining with D99894 , since we plan to add this to the public repo.
buildbot/amd/hip-build.sh | ||
---|---|---|
59–78 | That sounds good, we may also want to add larger builds later and may benefit from building VM. | |
80–89 | That is a good point, we will not require running HIP tests on all-green clang binaries either. At least those will be caught by other clang specific testers, and be redundant here. I'm interested in getting this up to catch those issues soon. | |
107–111 | It would be nice to also run our bots on the cloud. For now, we have a single vega20, although building for various targets to catch build issues. Does the Google Cloud have an options to test on AMD GPUs? |
buildbot/amd/hip-build.sh | ||
---|---|---|
107–111 | Unfortunately, no, AMD GPUs are not available on GCE. Using a single-GPU VM as the standard bot setup should work on multi-GPU machines, too. Just start one container per GPU, same as you'd do if/when AMD GPUs are available in the cloud. |
Moving this script to annotated/ directory. The AnnotatedBuilder.py now support running bash scripts.
Also, keeping this as a local uncommitted script during bring-up/staging.
General style nit: the script has very inconsistent quoting for the variables. They are quoted in some places but not others.
LGTM otherwise.
zorg/buildbot/builders/annotated/hip-build.sh | ||
---|---|---|
66 ↗ | (On Diff #339640) | Including -DLLVM_ENABLE_RUNTIMES="openmp" in this list should suffice to produce a toolchain that can run openmp offloading tests as well as the hip ones |
zorg/buildbot/builders/annotated/hip-build.sh | ||
---|---|---|
66 ↗ | (On Diff #339640) | This should be easy to do. Let me know when you have some OMP tests in the test-suite you'd like to run on NVIDIA GPUs and I'll try adding them to my CUDA bots. |
@tra have you seen this exception in your CUDA buildbots before?
https://lab.llvm.org/staging/#/builders/152/builds/3
Sorry, I haven't seen it before and Can't tell what exactly is the bot unhappy about.
It says unsupported operand type(s) for %: 'WithProperties' , but it's not clear why.
Looks like it was unrelated to the script, it was a bug fixed in D101575. Now the simple HIP buildbot is operating in staging/silent mode. Thanks for all the reviews.
Reusing the build directly will work most of the time, but it will likely have issues now and then. Based on my CUDA bots, I'd say few times a year. Not a very big deal, but something to consider if you want the bot to just work all the time.
Examples of potential sources of trouble:
cmake reusing cached configuration may sometimes fail, because some of the cached items no longer reflect the build environment.
I've seen ninja getting stuck in endless reconfiguration cycle, complaining about missing files or endlessly rebuilding something because because something went wrong with the dependencies it recorded during the previous build.
On the old CUDA bot I've eventually settled on cleaning the build directory + using ccache to speed up rebuilding.