This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
buildbot/amd/
-
amd/
-
buildbot-helper.sh
9/13
hip-build.sh

Differential D100060

[zorg] Add HIP builder script
ClosedPublic

Authored by ashi1 on Apr 7 2021, 11:53 AM.

Download Raw Diff

Details

Reviewers

tra
gkistanova
yaxunl

Commits

rZORG1a32648b32ec: [zorg] Add HIP builder script

Summary

Simple script for HIP builder to build
llvm-project incrementally, then build and
execute HIP tests from llvm-test-suite.

Diff Detail

Repository: rZORG LLVM Github Zorg

Event Timeline

Please note, this patch is under-development, and I've added it here, as open uncommitted review to allow HIP builder to use external script.
https://reviews.llvm.org/D99894

ashi1 mentioned this in D99894: [zorg] Add buildbot for HIP.Apr 7 2021, 11:58 AM

Looks reasonable overall.

Few drive-by comments below for the pitfalls you may eventually run into later.

buildbot/amd/hip-build.sh
59–78	Reusing the build directly will work most of the time, but it will likely have issues now and then. Based on my CUDA bots, I'd say few times a year. Not a very big deal, but something to consider if you want the bot to just work all the time. Examples of potential sources of trouble: cmake reusing cached configuration may sometimes fail, because some of the cached items no longer reflect the build environment. I've seen ninja getting stuck in endless reconfiguration cycle, complaining about missing files or endlessly rebuilding something because because something went wrong with the dependencies it recorded during the previous build. On the old CUDA bot I've eventually settled on cleaning the build directory + using ccache to speed up rebuilding.
80–89	This is fine for proof-of-concept bot. However, this will be the bottleneck for your bot. Clang building and testing is already thoroughly covered by tons of other bots. Ideally we want to avoid doing a full build/test, but rather build/test it once and then distribute the binaries to all the testsuite bots. On CUDA bots I have one VM continuously building/testing llvm and clang, and a handful machines that build and run testsuite on different GPUs. This cuts down on the commit-to-test-results cycle to just a few minutes which allows pinpointing failures to a single commit. With the full build/test which takes quite a bit longer, you often deal with multiple commits which complicates figuring out the offending one.
107–111	You're building the test for multiple GPUs. Do you plan to run the test executables on all GPU variants, or only on one? If it runs one one GPU only, you may want to have a loop here to run the tests on each of the available GPUs

In D100060#2675099, @tra wrote:

Looks reasonable overall.

Few drive-by comments below for the pitfalls you may eventually run into later.

Thank you for the comments, those are important issues we should address now, if not later. I had a few questions for the comments below.

buildbot/amd/hip-build.sh
59–78	That is a problem we want to fix, maybe not immediately. I will look into ccache in upcoming patches. How does your multiple test-VMs deal with ccache when they are pulling the builder-VM's binaries?
80–89	I'm wondering if its okay to skip the check-clang step for this bot. I like your idea of having one VM performing that check, and other bots re-using that build. Are you separating your bots between build-only (plus check-clang) and execution-only (on GPU) ? How do you deal with commits you've found to cause regression?
107–111	Thanks, we will probably start with one GPU, and start adding more machines with other GPUs. Do your buildbot machines have more than 1 GPU, and you configure which one gets run using CUDA_VISIBLES_DEVICES?

tra added inline comments.Apr 12 2021, 2:45 PM

buildbot/amd/hip-build.sh
59–78	Ccache is used to build LLVM only. Test binaries are configured and built with the fresh clang from scratch, without caching. Building the subset of tests that we're using right now does not take much time, so it's not a big deal. Eventually we may want to test something more convoluted, like eigen3, cub or thrust and that will take much more time to build. When we get there, I want to push compilation of the tests to the build VM (or, possibly, to another VM, if building the tests takes too long. For now building clang and building&running the tests are fairly well balanced -- builds are fast enough to catch single LLVM/clang commits most of the time and the tests are fast enough to be done on every build the builder makes.
80–89	At the moment the build VM produces clang binaries for the test VMs before it gets to run the tests on clang itself. I think it provides a minor net benefit over running CUDA tests with the tested-all-green clang binaries. IMO there's no harm running test-suite CUDA tests with the clang, even if it may have failed some tests. In the worst case, we may have a useless test failure which we can ignore until clang is fixed. On a positive side, we may have another failure which may be useful for understanding the failure. Again, my anecdotal experience is that this approach provides way better signa/noise ratio for CUDA tests. Most of the test failures in clang have nothing to do with CUDA. I'm willing to tolerate an occasional easy-to-identify false positive as the price of not having to deal with frequent failures I don't care about. So far, the bots did pretty good job catching the real issues and I've seen no false positives.
107–111	The bots run on Google Cloud now, so there's only one kind of GPU per bot. Previous incarnation of the CUDA buildbot ran on a local machine with multiple GPUs and I indeed used to have the list of GPUs to run the tests on, one GPU at a time, controlled via CUDA_VISIBLES_DEVICES.

I may be moving this patch and combining with D99894 , since we plan to add this to the public repo.

buildbot/amd/hip-build.sh
59–78	That sounds good, we may also want to add larger builds later and may benefit from building VM.
80–89	That is a good point, we will not require running HIP tests on all-green clang binaries either. At least those will be caught by other clang specific testers, and be redundant here. I'm interested in getting this up to catch those issues soon.
107–111	It would be nice to also run our bots on the cloud. For now, we have a single vega20, although building for various targets to catch build issues. Does the Google Cloud have an options to test on AMD GPUs?

tra added inline comments.Apr 20 2021, 9:49 AM

buildbot/amd/hip-build.sh
107–111	Unfortunately, no, AMD GPUs are not available on GCE. Using a single-GPU VM as the standard bot setup should work on multi-GPU machines, too. Just start one container per GPU, same as you'd do if/when AMD GPUs are available in the cloud.

Moving this script to annotated/ directory. The AnnotatedBuilder.py now support running bash scripts.

Also, keeping this as a local uncommitted script during bring-up/staging.

General style nit: the script has very inconsistent quoting for the variables. They are quoted in some places but not others.

LGTM otherwise.

This revision is now accepted and ready to land.Apr 22 2021, 9:54 AM

JonChesterfield added a subscriber: JonChesterfield.Apr 22 2021, 12:22 PM

JonChesterfield added inline comments.Apr 22 2021, 12:25 PM

zorg/buildbot/builders/annotated/hip-build.sh
66 ↗	(On Diff #339640)	Including `-DLLVM_ENABLE_RUNTIMES="openmp"` in this list should suffice to produce a toolchain that can run openmp offloading tests as well as the hip ones

tra added inline comments.Apr 22 2021, 1:14 PM

zorg/buildbot/builders/annotated/hip-build.sh
66 ↗	(On Diff #339640)	This should be easy to do. Let me know when you have some OMP tests in the test-suite you'd like to run on NVIDIA GPUs and I'll try adding them to my CUDA bots.

@tra have you seen this exception in your CUDA buildbots before?
https://lab.llvm.org/staging/#/builders/152/builds/3

In D100060#2726888, @ashi1 wrote:

@tra have you seen this exception in your CUDA buildbots before?
https://lab.llvm.org/staging/#/builders/152/builds/3

Sorry, I haven't seen it before and Can't tell what exactly is the bot unhappy about.
It says unsupported operand type(s) for %: 'WithProperties' , but it's not clear why.

In D100060#2727182, @tra wrote:

In D100060#2726888, @ashi1 wrote:

@tra have you seen this exception in your CUDA buildbots before?
https://lab.llvm.org/staging/#/builders/152/builds/3

Sorry, I haven't seen it before and Can't tell what exactly is the bot unhappy about.
It says unsupported operand type(s) for %: 'WithProperties' , but it's not clear why.

Looks like it was unrelated to the script, it was a bug fixed in D101575. Now the simple HIP buildbot is operating in staging/silent mode. Thanks for all the reviews.

Closed by commit rZORG1a32648b32ec: [zorg] Add HIP builder script (authored by ashi1). · Explain WhyJul 20 2021, 9:35 AM

This revision was automatically updated to reflect the committed changes.

ashi1 added a commit: rZORG1a32648b32ec: [zorg] Add HIP builder script.

Revision Contents

Path

Size

buildbot/

amd/

buildbot-helper.sh

12 lines

hip-build.sh

114 lines

Diff 335889

buildbot/amd/buildbot-helper.sh

This file was added.

				#!/bin/bash

				# Buildbot Helper functions

				# Stop if we've encountered an error.
				halt_on_failure() {
				echo "@@@HALT_ON_FAILURE@@@"
				}

				build_step() {
				echo "@@@BUILD_STEP ${@}@@@"
				}

buildbot/amd/hip-build.sh

This file was added.

				#!/bin/bash

				# Enable Error tracing
				set -o errtrace

				# Print trace for all commands ran before execution
				set -x

				# Include the Buildbot helper functions
				. ./buildbot-helper.sh

				# Ensure all commands pass, and not dereferencing unset variables.
				set -eu
				halt_on_failure

				BUILDBOT_ROOT=$(readlink -f .)
				LLVM_ROOT="${BUILDBOT_ROOT}/llvm-project"
				REVISION=${BUILDBOT_REVISION:-origin/main}
				AMDGPU_ARCHS=${AMDGPU_ARCHS:="gfx900;gfx906;gfx908;gfx1030"}

				# Set-up llvm-project
				if [ ! -d "${LLVM_ROOT}" ]; then
				build_step "Cloning llvm-project repo"
				git clone --progress https://github.com/llvm/llvm-project.git ${LLVM_ROOT}
				fi

				build_step "Updating llvm-project repo"
				git -C "${LLVM_ROOT}" fetch origin
				git -C "${LLVM_ROOT}" reset --hard ${REVISION}

				# Set-up llvm-test-suite
				TESTSUITE_ROOT="${BUILDBOT_ROOT}/llvm-test-suite"
				if [ ! -d "${TESTSUITE_ROOT}" ]; then
				build_step "Cloning llvm-test-suite repo"
				git clone --progress https://github.com/llvm/llvm-test-suite.git ${TESTSUITE_ROOT}
				fi

				build_step "Updating llvm-test-suite repo"
				git -C "${TESTSUITE_ROOT}" fetch origin
				git -C "${TESTSUITE_ROOT}" reset --hard origin/main

				# Set-up variables
				BUILDBOT_SLAVENAME=hip-vega20-0
				BUILDBOT_BUILDERNAME=clang-hip-vega20
				BUILD_DIR="${BUILDBOT_ROOT}/${BUILDBOT_SLAVENAME}/${BUILDBOT_BUILDERNAME}"
				DESTDIR=${BUILD_DIR}/install
				EXTERNAL_DIR=/buildbot/Externals

				build_step "Setting up the buildbot"
				echo "BUILDBOT_ROOT=${BUILDBOT_ROOT}"
				echo "LLVM_ROOT=${LLVM_ROOT}"
				echo "BUILD_DIR=${BUILD_DIR}"
				echo "DESTDIR=${DESTDIR}"
				echo "EXTERNAL_DIR=${EXTERNAL_DIR}"

				# Start building LLVM, Clang, Lld, clang-tools-extra, compiler-rt
				build_step "Configure LLVM Build"
				LLVM_BUILD_DIR="${BUILD_DIR}/llvm"
				mkdir -p "${LLVM_BUILD_DIR}"
				cd "${LLVM_BUILD_DIR}"
				cmake -G Ninja \
				-DCMAKE_BUILD_TYPE="Release" \
				-DCMAKE_VERBOSE_MAKEFILE=1 \
				-DLLVM_TARGETS_TO_BUILD="AMDGPU;X86" \
				-DLLVM_ENABLE_PROJECTS="clang;lld;clang-tools-extra;compiler-rt;libcxx;libcxxabi" \
				-DLIBCXX_ENABLE_SHARED=OFF \
				-DLIBCXX_ENABLE_STATIC=ON \
				-DLIBCXX_INSTALL_LIBRARY=OFF \
				-DLIBCXX_INSTALL_HEADERS=OFF \
				-DLIBCXXABI_ENABLE_SHARED=OFF \
				-DLIBCXXABI_ENABLE_STATIC=ON \
				-DLIBCXXABI_INSTALL_STATIC_LIBRARY=OFF \
				-DCMAKE_INSTALL_PREFIX="${DESTDIR}" \
				-DLLVM_ENABLE_ASSERTIONS=ON \
				-DLLVM_ENABLE_Z3_SOLVER=OFF \
				-DLLVM_ENABLE_ZLIB=ON \
				-DLLVM_LIT_ARGS="-v -vv" \
				${LLVM_ROOT}/llvm
				traUnsubmitted Not Done Reply Inline Actions Reusing the build directly will work most of the time, but it will likely have issues now and then. Based on my CUDA bots, I'd say few times a year. Not a very big deal, but something to consider if you want the bot to just work all the time. Examples of potential sources of trouble: cmake reusing cached configuration may sometimes fail, because some of the cached items no longer reflect the build environment. I've seen ninja getting stuck in endless reconfiguration cycle, complaining about missing files or endlessly rebuilding something because because something went wrong with the dependencies it recorded during the previous build. On the old CUDA bot I've eventually settled on cleaning the build directory + using ccache to speed up rebuilding. tra: Reusing the build directly will work most of the time, but it will likely have issues now and…
				ashi1AuthorUnsubmitted Done Reply Inline Actions That is a problem we want to fix, maybe not immediately. I will look into ccache in upcoming patches. How does your multiple test-VMs deal with ccache when they are pulling the builder-VM's binaries? ashi1: That is a problem we want to fix, maybe not immediately. I will look into ccache in upcoming…
				traUnsubmitted Done Reply Inline Actions Ccache is used to build LLVM only. Test binaries are configured and built with the fresh clang from scratch, without caching. Building the subset of tests that we're using right now does not take much time, so it's not a big deal. Eventually we may want to test something more convoluted, like eigen3, cub or thrust and that will take much more time to build. When we get there, I want to push compilation of the tests to the build VM (or, possibly, to another VM, if building the tests takes too long. For now building clang and building&running the tests are fairly well balanced -- builds are fast enough to catch single LLVM/clang commits most of the time and the tests are fast enough to be done on every build the builder makes. tra: Ccache is used to build LLVM only. Test binaries are configured and built with the fresh clang…
				ashi1AuthorUnsubmitted Done Reply Inline Actions That sounds good, we may also want to add larger builds later and may benefit from building VM. ashi1: That sounds good, we may also want to add larger builds later and may benefit from building VM.

				build_step "Building LLVM"
				ninja

				build_step "Install LLVM"
				rm -rf "${DESTDIR}"
				ninja install

				build_step "Test Clang"
				# Start building llvm-test-suite's hip tests
				ninja check-clang
				traUnsubmitted Not Done Reply Inline Actions This is fine for proof-of-concept bot. However, this will be the bottleneck for your bot. Clang building and testing is already thoroughly covered by tons of other bots. Ideally we want to avoid doing a full build/test, but rather build/test it once and then distribute the binaries to all the testsuite bots. On CUDA bots I have one VM continuously building/testing llvm and clang, and a handful machines that build and run testsuite on different GPUs. This cuts down on the commit-to-test-results cycle to just a few minutes which allows pinpointing failures to a single commit. With the full build/test which takes quite a bit longer, you often deal with multiple commits which complicates figuring out the offending one. tra: This is fine for proof-of-concept bot. However, this will be the bottleneck for your bot.
				ashi1AuthorUnsubmitted Done Reply Inline Actions I'm wondering if its okay to skip the check-clang step for this bot. I like your idea of having one VM performing that check, and other bots re-using that build. Are you separating your bots between build-only (plus check-clang) and execution-only (on GPU) ? How do you deal with commits you've found to cause regression? ashi1: I'm wondering if its okay to skip the check-clang step for this bot. I like your idea of having…
				traUnsubmitted Done Reply Inline Actions At the moment the build VM produces clang binaries for the test VMs before it gets to run the tests on clang itself. I think it provides a minor net benefit over running CUDA tests with the tested-all-green clang binaries. IMO there's no harm running test-suite CUDA tests with the clang, even if it may have failed some tests. In the worst case, we may have a useless test failure which we can ignore until clang is fixed. On a positive side, we may have another failure which may be useful for understanding the failure. Again, my anecdotal experience is that this approach provides way better signa/noise ratio for CUDA tests. Most of the test failures in clang have nothing to do with CUDA. I'm willing to tolerate an occasional easy-to-identify false positive as the price of not having to deal with frequent failures I don't care about. So far, the bots did pretty good job catching the real issues and I've seen no false positives. tra: At the moment the build VM produces clang binaries for the test VMs before it gets to run the…
				ashi1AuthorUnsubmitted Done Reply Inline Actions That is a good point, we will not require running HIP tests on all-green clang binaries either. At least those will be caught by other clang specific testers, and be redundant here. I'm interested in getting this up to catch those issues soon. ashi1: That is a good point, we will not require running HIP tests on all-green clang binaries either.

				build_step "Configuring HIP test-suite"
				TEST_BUILD_DIR=${BUILD_DIR}/test-suite-build
				rm -rf ${TEST_BUILD_DIR}
				mkdir -p ${TEST_BUILD_DIR}
				cd ${TEST_BUILD_DIR}
				PATH="${LLVM_BUILD_DIR}/bin:$PATH" cmake -G Ninja \
				-DTEST_SUITE_SUBDIRS=External \
				-DTEST_SUITE_EXTERNALS_DIR=${EXTERNAL_DIR}/ \
				-DTEST_SUITE_COLLECT_CODE_SIZE=OFF \
				-DTEST_SUITE_COLLECT_COMPILE_TIME=OFF \
				-DAMDGPU_ARCHS="${AMDGPU_ARCHS}" \
				-DCMAKE_CXX_COMPILER="${LLVM_BUILD_DIR}/bin/clang++" \
				-DCMAKE_C_COMPILER="${LLVM_BUILD_DIR}/bin/clang" \
				-DCMAKE_VERBOSE_MAKEFILE=ON \
				${TESTSUITE_ROOT}

				build_step "Building HIP test-suite"
				ninja hip-tests-simple

				build_step "Testing HIP test-suite"
				ninja check-hip-simple
				traUnsubmitted Not Done Reply Inline Actions You're building the test for multiple GPUs. Do you plan to run the test executables on all GPU variants, or only on one? If it runs one one GPU only, you may want to have a loop here to run the tests on each of the available GPUs tra: You're building the test for multiple GPUs. Do you plan to run the test executables on all GPU…
				ashi1AuthorUnsubmitted Done Reply Inline Actions Thanks, we will probably start with one GPU, and start adding more machines with other GPUs. Do your buildbot machines have more than 1 GPU, and you configure which one gets run using CUDA_VISIBLES_DEVICES? ashi1: Thanks, we will probably start with one GPU, and start adding more machines with other GPUs. Do…
				traUnsubmitted Done Reply Inline Actions The bots run on Google Cloud now, so there's only one kind of GPU per bot. Previous incarnation of the CUDA buildbot ran on a local machine with multiple GPUs and I indeed used to have the list of GPUs to run the tests on, one GPU at a time, controlled via CUDA_VISIBLES_DEVICES. tra: The bots run on Google Cloud now, so there's only one kind of GPU per bot. Previous…
				ashi1AuthorUnsubmitted Done Reply Inline Actions It would be nice to also run our bots on the cloud. For now, we have a single vega20, although building for various targets to catch build issues. Does the Google Cloud have an options to test on AMD GPUs? ashi1: It would be nice to also run our bots on the cloud. For now, we have a single vega20, although…
				traUnsubmitted Not Done Reply Inline Actions Unfortunately, no, AMD GPUs are not available on GCE. Using a single-GPU VM as the standard bot setup should work on multi-GPU machines, too. Just start one container per GPU, same as you'd do if/when AMD GPUs are available in the cloud. tra: Unfortunately, no, AMD GPUs are not available on GCE. Using a single-GPU VM as the standard…

				exit 0