This is an archive of the discontinued LLVM Phabricator instance.

I don't want to nuke an already-running mlir cluster when I'm changing something in my setup. Many terraform operations result in 'tear down everything and create it from scratch'.
Also, MLIR's requirements are not exactly identical to my CUDA bot requirements. If you've noticed, the VMs in cudabot clusters have notably different configuration from the MLIR's ones.

We can share these across our machines.

I'm not sure about that. Both MLIR and CUDA need a GPU to work with and GPUs are not shareable. So, if a machine already runs a pod which requested a GPU. So, in the end you will need the same number of VMs w/ GPUs. You could share the controller, but that's a negligible cost compared to everything else. In addition to that, the VM configuration for CUDA bots is tweaked for the CUDA buildbot workload. One of the pools has 24 cores, while the other two run with only 8 (and I may further reduce it). That arrangement may not be the right one for the MLIR.

Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.

In D84258#2167464, @tra wrote:

Why create another cluster and another node pool?

I don't want to nuke an already-running mlir cluster when I'm changing something in my setup. Many terraform operations result in 'tear down everything and create it from scratch'.
Also, MLIR's requirements are not exactly identical to my CUDA bot requirements. If you've noticed, the VMs in cudabot clusters have notably different configuration from the MLIR's ones.

The buildbots get restarted every 24h anyway. So I suppose they can handle 1-2 more restarts. I also would not expect that many re-deployments of the cluster or of the node pools. At least for my setup this has become relatively stable. Minor changes can even be done on the fly. I only re-deployed the cluster today to move from 16 to 32 cores. And that causes a re-deployment anyway.

We can share these across our machines.

I'm not sure about that. Both MLIR and CUDA need a GPU to work with and GPUs are not shareable. So, if a machine already runs a pod which requested a GPU. So, in the end you will need the same number of VMs w/ GPUs. You could share the controller, but that's a negligible cost compared to everything else. In addition to that, the VM configuration for CUDA bots is tweaked for the CUDA buildbot workload. One of the pools has 24 cores, while the other two run with only 8 (and I may further reduce it). That arrangement may not be the right one for the MLIR.

I would not run multiple containers on one VM. As you said, k8s cannot share one GPUs across containers. I would rather create one "build slave" per VM (or a group of "build slaves" each in a separate VM) in buildbot and then have that VM(s) execute a set of "builders". We could have an m:n mapping of "build slaves" and "builders".

My mlir-nvidia builder is not very picky. It would probably run on any of your machines as long as it has an Nvidia card. Sorry about the non-inclusive wording here, but that's what buildbot calls them in the UI.

Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.

But yes, having a tighter coupling would increase the number of conflicts of parallel edits. We would also somehow have to make sure we're not trying to deploy two different things in parallel...

In D84258#2172337, @kuhnel wrote:

I would not run multiple containers on one VM. As you said, k8s cannot share one GPUs across containers. I would rather create one "build slave" per VM (or a group of "build slaves" each in a separate VM) in buildbot and then have that VM(s) execute a set of "builders". We could have an m:n mapping of "build slaves" and "builders".

My mlir-nvidia builder is not very picky. It would probably run on any of your machines as long as it has an Nvidia card. Sorry about the non-inclusive wording here, but that's what buildbot calls them in the UI.

That may be doable.
At the moment, all CUDA bots do their own build of test-suite tests. I'm planning to figure out how to build them once and get the GPU-enabled machines to only do tests.
GPU tests are relatively fast, compared to building them so there will be plenty of time to share.
If MLIR can also split build from test, then sharing GPU-enabled VM for multiple builders makes sense.
However, as long as each builder is expected to compile something substantial, sharing will be at the expense of higher latency for the bot results -- one of the issues we want to fix here.

Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.

But yes, having a tighter coupling would increase the number of conflicts of parallel edits. We would also somehow have to make sure we're not trying to deploy two different things in parallel...

Interlocking two builders is relatively easy if we use annotated builder scripts. We can just add flock /builder/global-build-lock to the build scripts at strategic points.

Let's keep the clusters & pools separate for now. We'll revisit the issue once I evolve the setup to have separate build/test machines. Then we can consider consolidating things.

Updated directory structure.

Harbormaster completed remote builds in B65617: Diff 280545.Jul 24 2020, 12:11 PM

kuhnel mentioned this in rZORG74e099cb9569: Added automatic cluster config and deployment.Aug 5 2020, 5:33 AM

Revision Contents

Path

Size

buildbot/

google/

docker/

buildbot-cuda/

build_deploy.sh

77 lines

build_run.sh

31 lines

cudabot/

70 lines

1 line

74 lines

212 lines

6 lines

terraform/

buildbot-cuda/

README.md

50 lines

cudabot-deployment-k80.yaml

68 lines

cudabot-deployment-p4.yaml

72 lines

cudabot-deployment-t4.yaml

65 lines

246 lines

17 lines

Diff 280545

buildbot/google/docker/buildbot-cuda/build_deploy.sh

This file was added.

Property	Old Value	New Value
File Mode	null	100755

				#!/bin/bash
				#===-- build_deploy.sh ---------------------------------------------------===//
				# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				# See https://llvm.org/LICENSE.txt for license information.
				# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				#
				#===----------------------------------------------------------------------===//
				# This script will deploy a docker image to the registry.
				# Arguments: <path to Dockerfile>
				#
				# This updates the `VERSION` file with the latest version number.
				#===----------------------------------------------------------------------===//

				set -eu

				DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
				IMAGE_NAME="${1%/}"

				# increment version number
				cd "${DIR}/${IMAGE_NAME}"

				# get version numbers from repository
				# FIXME: use variables to configure URL
				ALL_VERSIONS=$(gcloud container images list-tags gcr.io/sanitizer-bots/${IMAGE_NAME} --format=text \| \
				awk '/tags.*:\W+[0-9]+$/ {print $2}' \| tail -1)
				# read local version number from file and add it to the array
				ALL_VERSIONS+=($(cat VERSION))
				# find maximum version number and increment it
				VERSION=$(echo "${ALL_VERSIONS[*]}" \| tr ' ' '\n' \| sort -nr \| head -n1)
				VERSION=$(( ${VERSION} + 1 ))

				if false; then
				# get the git hash and add some suffixes
				GIT_HASH=$(git rev-parse HEAD)
				if [[ $(git diff --stat) != '' ]]; then
				# if working copy is dirty
				GIT_HASH+="-dirty-${USER}"
				elif [[ $(git --no-pager diff origin/master \| wc -l) > 0 ]]; then
				# if the hash has not been uploaded to origin/master yet
				GIT_HASH+="-local-${USER}"
				fi
				else
				GIT_HASH=c0ffee
				fi

				# fully qualified image name
				# FIXME: use variables to configure URL
				QUALIFIED_NAME="gcr.io/sanitizer-bots/${IMAGE_NAME}"
				# tags to be added to the image and pushed to the repository
				TAGS=(
				"${QUALIFIED_NAME}:latest"
				"${QUALIFIED_NAME}:${VERSION}"
				"${QUALIFIED_NAME}:${GIT_HASH}"
				)

				# build the image and tag it locally
				docker build -t ${IMAGE_NAME}:latest -t ${IMAGE_NAME}:${VERSION} .

				# print the list of tags to be pushed
				echo "-----------------------------------------"
				echo "image version: ${VERSION}"
				echo "tags:"
				printf ' %s\n' "${TAGS[@]}"
				echo "-----------------------------------------"
				read -p "Push to registry? [yN]" -n 1 -r
				echo

				if [[ $REPLY =~ ^[Yy]$ ]]
				then
				for TAG in "${TAGS[@]}"
				do
				docker tag ${IMAGE_NAME}:${VERSION} "${TAG}"
				docker push "${TAG}"
				done
				# store the version number
				echo "${VERSION}" > VERSION
				fi

buildbot/google/docker/buildbot-cuda/build_run.sh

This file was added.

Property	Old Value	New Value
File Mode	null	100755

				#!/bin/bash
				#===-- build_run.sh ------------------------------------------------------===//
				# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				# See https://llvm.org/LICENSE.txt for license information.
				# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				#
				#===----------------------------------------------------------------------===//
				# This script will deploy a docker image to the registry.
				# Arguments:
				# <path to Dockerfile>
				# <path containing secrets>
				# optional: <command to be executed in the container>
				#===----------------------------------------------------------------------===//

				set -eux

				DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
				IMAGE_NAME="${1%/}"
				SECRET_STORAGE="$2"
				CMD=
				if [ "$#" -eq 3 ];
				then
				CMD="$3"
				fi

				cd "${DIR}/${IMAGE_NAME}"

				docker build -t "${IMAGE_NAME}:latest" .
				docker run -it -v "${SECRET_STORAGE}":/secrets -e BUILDBOTS=cuda-gce-test-t4-0 \
				--tmpfs=/memfs:exec \
				"${IMAGE_NAME}" ${CMD}

buildbot/google/docker/buildbot-cuda/cudabot/Dockerfile

This file was added.

				# There is already an Ubuntu image with cuda :)
				FROM nvidia/cuda:10.2-base

				# for the host configuration see:
				# https://github.com/NVIDIA/nvidia-docker

				# install build tools
				# set -eux;\
				RUN apt-get update; \
				apt-get install -y software-properties-common apt-transport-https \
				ca-certificates ninja-build cuda-compat-11-0 \
				python-virtualenv python-pip python3-pip \
				python-psutil git zstd wget gnupg ccache 'libstdc++-*-dev' \
				&& wget -qO- "https://raw.githubusercontent.com/chromium/chromium/master/tools/clang/scripts/update.py" \
				\| /usr/bin/python - --output-dir=/usr/local/clang \
				&& update-alternatives --install /usr/bin/clang clang /usr/local/clang/bin/clang 200 \
				&& update-alternatives --install /usr/bin/clang++ clang++ /usr/local/clang/bin/clang++ 200 \
				&& update-alternatives --install /usr/bin/lld lld /usr/local/clang/bin/lld 200 \
				--slave /usr/bin/ld.lld ld.lld /usr/local/clang/bin/lld

				# install cuda
				RUN wget --progress=bar:force:noscroll http://developer.download.nvidia.com/compute/cuda/11.0.1/local_installers/cuda_11.0.1_450.36.06_linux.run \
				&& bash cuda_11.0.1_450.36.06_linux.run --silent --defaultroot=/usr/local/cuda-11.0 --toolkit --toolkitpath=/usr/local/cuda-11.0 \
				&& rm cuda_11.0.1_450.36.06_linux.run \
				&& ( cd /usr/local/cuda-11.0 \
				&& rm -rf nsight doc lib64/sparse lib64/solver lib64/fft lib64/*_static.a )
				RUN wget --progress=bar:force:noscroll http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run \
				&& bash cuda_10.2.89_440.33.01_linux.run --silent --defaultroot=/usr/local/cuda-10.2 --toolkit --toolkitpath=/usr/local/cuda-10.2 \
				&& rm cuda_10.2.89_440.33.01_linux.run \
				&& ( cd /usr/local/cuda-10.2 \
				&& rm -rf nsight doc lib64/sparse lib64/solver lib64/fft lib64/*_static.a )

				RUN wget -qO- https://packages.cloud.google.com/apt/doc/apt-key.gpg \
				\| apt-key --keyring /etc/apt/trusted.gpg.d/kitware.gpg add - \
				&& apt-add-repository "deb http://packages.cloud.google.com/apt cloud-sdk main" \
				&& apt-get update \
				&& apt-get install -y google-cloud-sdk

				RUN pip install buildbot-slave==0.8.12 \
				&& pip3 install lit

				RUN groupadd -g 999 builder \
				&& useradd -r -u 999 -g builder -d /buildbot builder \
				&& mkdir /buildbot \
				&& chown builder:builder /buildbot

				# Ubuntu ships with old cmake version, install the latest one
				# from https://apt.kitware.com/
				RUN wget -qO - https://github.com/Kitware/CMake/releases/download/v3.17.3/cmake-3.17.3-Linux-x86_64.tar.gz \
				\| tar -C /usr/local --strip-components=1 -zxf -

				#
				RUN apt-get install -y dstat

				USER builder

				# Speed up git clone on restart
				RUN git config --global pack.threads 8
				# Pre-stage test-suite repo with huge sources removed.
				RUN git clone "https://github.com/llvm/llvm-test-suite.git" /buildbot/llvm-test-suite \
				&& rm -rf /buildbot/llvm-test-suite/{ABI-Testsuite,Bitcode,MultiSource}

				ENV PATH=$PATH:/buildbot
				ENV LD_LIBRARY_PATH=/usr/local/cuda-11.0/compat:$LD_LIBRARY_PATH
				COPY bootstrap.sh /buildbot
				COPY cuda-build.sh /buildbot/cuda-build
				# Test bots run the same script for now.
				COPY cuda-build.sh /buildbot/cuda-test
				COPY external.py /buildbot
				CMD /buildbot/bootstrap.sh

buildbot/google/docker/buildbot-cuda/cudabot/VERSION

This file was added.

buildbot/google/docker/buildbot-cuda/cudabot/bootstrap.sh

This file was added.

Property	Old Value	New Value
File Mode	null	100755

				#! /bin/bash

				# Read the worker password from a mounted file.
				_BUILDBOT_PASSWD=$(cat /secrets/token)
				# buildbot logs environment vars. Unset the password so it does not leak.
				unset BUILDBOT_PASSWD

				BUILDBOT_MASTER="${BUILDBOT_MASTER:-lab.llvm.org:9994}"

				# It looks like GKE sometimes deploys the container before the NVIDIA drivers
				# are loaded on the host. In this case the GPU is not available during the
				# entire lifecycle of the container. Not sure how to fix this properly.

				RETURN_CODE=$(nvidia-smi > /dev/null ; echo $?)
				if [[ "$RETURN_CODE" != "0" ]] ; then
				echo "ERROR: Failed to access NVIDIA graphics card." \
				\| tee /dev/termination-log
				echo "Exiting in 30 secs..."
				sleep 30
				exit 1
				fi

				cd /buildbot/

				for name in $BUILDBOTS ; do
				buildslave create-slave \
				"${name}" "${BUILDBOT_MASTER}" \
				"${name}" "${_BUILDBOT_PASSWD}"

				# populate host info.
				(
				uname -a ;
				cat /proc/cpuinfo \| grep "model name" \| head -n1 \| cut -d " " -f 3- ;
				echo "number of cores: $(nproc)" ;
				nvidia-smi -L \| cut -d "(" -f 1 ;
				lsb_release -d \| cut -f 2- ;
				clang --version \| head -n1 ;
				ld.lld --version ;
				cmake --version \| head -n1
				) > ${name}/info/host
				echo "Artem Belevich <tra@google.com>" > "${name}/info/admin"

				buildslave start "${name}"
				# tail logs and exit when buildslave is done.
				TAIL_ARGS="${TAIL_ARGS} -f ${name}/twistd.log --pid $(cat ${name}/twistd.pid)"
				BUILDBOT="$name"
				done


				# log all bots specified on the command line.
				if [ "${TAIL_ARGS}" = "" ]; then
				echo "No build bots specified in BUILDBOTS environment" \
				\| tee /dev/termination-log
				exit
				fi

				# Restore ccache
				CCACHE_SNAPSHOT="gs://cudabot-gce-artifacts/ccache-snapshot-${BUILDBOT}.tar.zst"
				# Record snapshot location for future updates.
				echo "${CCACHE_SNAPSHOT}" > "$HOME/ccache_snapshot.uri"
				# This is a best-effort attempt to populate ccache. Remove ccache on any failures.
				(gsutil cp "${CCACHE_SNAPSHOT}" - \| zstd -d \| tar -C $HOME -xf -) \
				\|\| rm -rf ${HOME}/ccache

				# If we've got some bots running, tail their logs.
				# Leave a tombstone termination log message in case we get terminated.
				echo "Buildbot terminated unexpectedly" > /dev/termination-log
				tail ${TAIL_ARGS}
				echo "Buildbot has finished. "

				while [ -f $HOME/DO_NOT_QUIT ]; do
				echo "DO_NOT_QUIT file is present. Staying alive."
				sleep 60
				done

buildbot/google/docker/buildbot-cuda/cudabot/cuda-build.sh

This file was added.

Property	Old Value	New Value
File Mode	null	100755

				#!/bin/bash

				set -euE
				set -o pipefail
				trap 'kill $$' ERR

				# Stop if we've encountered an error.
				echo "@@@HALT_ON_FAILURE@@@"

				function step() {
				echo "@@@BUILD_STEP ${@}@@@"
				}

				function step_summary() {
				echo "@@@STEP_SUMMARY_TEXT@${@}@@@"
				}

				function step_exception() {
				echo "@@@STEP_EXCEPTION@@@"
				}

				function run() {
				echo ">>> " "${@}"
				"${@}"
				}

				BUILDBOT_DIR=$(readlink -f ..)
				LLVM_TREE="${BUILDBOT_DIR}/llvm-project"
				REVISION=${BUILDBOT_REVISION:-origin/master}
				NPROC=$(nproc)
				# By default build for all major architectures.
				GPU_ARCH=${GPU_ARCH:="sm_35;sm_60;sm_75"}
				# K80 sometimes hangs/deadlocks on parallal jobs. Allow overriding it.
				CUDA_TEST_JOBS=${CUDA_TEST_JOBS:-4}

				if [ ! -d "${LLVM_TREE}" ]; then
				step "Checking out LLVM tree."
				run git clone --shallow-since="1 week ago" --progress \
				https://github.com/llvm/llvm-project.git \
				${LLVM_TREE}
				run touch ${BUILDBOT_DIR}/.we_own_llvm_tree
				fi

				if [ -f ${BUILDBOT_DIR}/.we_own_llvm_tree ]; then
				step "Updating LLVM tree"
				run git -C "${LLVM_TREE}" fetch origin
				run git -C "${LLVM_TREE}" reset --hard ${REVISION}
				fi

				step "Setting up the build."
				LLVM_DIR="${LLVM_TREE}/llvm"
				MEMFS_DIR="/memfs"
				BUILD_BASE_DIR="${MEMFS_DIR}/build"
				BUILD_DIR="${BUILD_BASE_DIR}/${BUILDBOT_SLAVENAME}/${BUILDBOT_BUILDERNAME}"
				export DESTDIR=${BUILD_DIR}/install
				echo BUILD_DIR=${BUILD_DIR}
				echo LLVM_DIR="${LLVM_DIR}"
				run rm -rf "${BUILD_DIR}"
				export TMPDIR=${MEMFS_DIR}/tmp
				run mkdir -p ${TMPDIR}

				# CCACHE is re-populated from the snapshot by bootstrap script.
				export CCACHE_DIR="${HOME}/ccache"
				run mkdir -p "${CCACHE_DIR}" # In case there's no bootstrapped ccache.
				run ccache -M 5GB
				CCACHE_SNAPSHOT=$(cat $HOME/ccache_snapshot.uri)
				echo "CCACHE snapshot location: ${CCACHE_SNAPSHOT}"

				# Check out/update test suite.
				TESTSUITE_DIR="${HOME}/llvm-test-suite"
				if [ -d "${TESTSUITE_DIR}" ]; then
				step "Update LLVM test suite"
				run git -C "${TESTSUITE_DIR}" fetch
				run git -C "${TESTSUITE_DIR}" reset --hard origin/master
				else
				step "Check out LLVM test suite"
				run git clone --progress "https://github.com/llvm/llvm-test-suite.git" "${TESTSUITE_DIR}"
				fi

				EXT_DIR="${BUILD_DIR}/externals/cuda"
				run rm -rf "${EXT_DIR}"
				run mkdir -p "${EXT_DIR}"

				# Creates fake GCC installation
				function create_fake_gcc_install() {
				VERSION=$1
				DIR=${EXT_DIR}/gcc-${VERSION}
				mkdir -p ${DIR}/include/c++
				mkdir -p ${DIR}/bin
				ln -s /usr/include/c++/$VERSION ${DIR}/include/c++/$VERSION
				# Work around https://github.com/ninja-build/ninja/issues/1330
				# Otherwise ninja collapses '..' in the deps paths and can't find the headers.
				ln -s /usr/include/x86_64-linux-gnu ${DIR}/include
				mkdir -p ${DIR}/lib/gcc/x86_64-unknown-linux-gnu
				ln -s /usr/lib/gcc/x86_64-linux-gnu/$VERSION ${DIR}/lib/gcc/x86_64-unknown-linux-gnu/$VERSION
				cat <<EOF >${DIR}/bin/gcc
				#! /bin/bash
				clang++ --gcc-toolchain=\$(dirname \$0)/.. "\$@"
				EOF
				chmod a+x ${DIR}/bin/gcc
				}

				# Create a fake GCC installation for every libstdc++ version we've found.
				for dir in $(find /usr/include/c++/ -mindepth 1 -maxdepth 1 -type d); do
				v=$(basename $dir)
				create_fake_gcc_install $v
				done

				# Set up links to CUDA variants we have installed.
				for dir in $(find /usr/local -mindepth 1 -maxdepth 1 -type d -name cuda-\*); do
				ln -s $dir ${EXT_DIR}
				done

				function build_and_test() {
				step "Configure Clang"
				LLVM_BUILD_DIR="${BUILD_DIR}/llvm"
				run rm -rf "${LLVM_BUILD_DIR}"
				run mkdir -p "${LLVM_BUILD_DIR}"
				run cd "${LLVM_BUILD_DIR}"
				run cmake -G Ninja -DCMAKE_EXPORT_COMPILE_COMMANDS=1 \
				-DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON \
				-DLLVM_CCACHE_BUILD=ON -DLLVM_USE_LINKER=lld \
				-DLLVM_ENABLE_PROJECTS="clang;libcxx;libcxxabi;libunwind" \
				-DCMAKE_CXX_COMPILER="clang++" -DCMAKE_C_COMPILER="clang" \
				-DLLVM_TOOL_CLANG_TOOLS_EXTRA_BUILD=false \
				-DLLVM_LIT_ARGS="-v -vv" \
				${LLVM_DIR}

				step "Building LLVM & Clang".
				run ninja
				run rm -rf "${DESTDIR}"
				run ninja install

				# Save LLVM/clang binaries to be reused by other bots.
				if [ "${BUILDBOT_BUILDERNAME}" = "clang-cuda-gce-test-t4" ]; then
				echo "Uploading installed binaries to Cloud"
				echo "Revision: ${BUILDBOT_REVISION}"
				tar -C "${DESTDIR}" -cf - . \
				\| zstd -3 -T20 \
				\| gsutil -o GSUtil:parallel_composite_upload_threshold=150M \
				cp - gs://cudabot-gce-artifacts/llvm-${BUILDBOT_REVISION}.tar.zst
				# Save ninja build log for performance analysis.
				gsutil cp .ninja_log gs://cudabot-gce-artifacts/llvm-${BUILDBOT_REVISION}.ninja_log
				fi

				# Save ccache snapshot.
				tar -C "${HOME}" -cf - ccache \
				\| zstd -3 -T20 \
				\| gsutil -o GSUtil:parallel_composite_upload_threshold=150M \
				cp - "${CCACHE_SNAPSHOT}"

				step "Testing LLVM"
				run ninja check-llvm

				step "Testing Clang"
				run ninja check-clang
				}

				function fetch_prebuilt_clang () {
				local revision="$1"
				local destdir="$2"
				local timeout="10 minutes" # A bit longer than a typical fast bot build time.
				local endtime=$(date -ud "$timeout" +%s)
				local snapshot="gs://cudabot-gce-artifacts/llvm-${revision}.tar.zst"

				step "Waiting for LLVM & Clang snapshot to be built. "
				while [[ $(date -u +%s) -le $endtime ]]
				do
				if gsutil ls -l "${snapshot}" ; then
				mkdir -p "${destdir}"
				gsutil cp ${snapshot} - \| zstd -d \| tar -C "${destdir}" -xf -
				# We've got the snapshot and are done here.
				return
				fi
				echo "$(date +%H:%M:%S) No snapshot available yet. Still waiting."
				sleep 20
				done
				# We've timed out waiting for the snapshot. Bail out and try again.
				step_summary "Timed out waiting for LLVM binaries to be built."
				step_exception
				exit 0
				}

				if [[ $NPROC -gt 16 ]] ; then
				build_and_test
				else
				# Machine is too slow. Wait for some other bot to build us a copy.
				fetch_prebuilt_clang "${BUILDBOT_REVISION}" "${DESTDIR}"
				fi

				step "Configuring CUDA test-suite"
				TEST_BUILD_DIR=${BUILD_DIR}/test-suite-build
				run rm -rf ${TEST_BUILD_DIR}
				run mkdir -p ${TEST_BUILD_DIR}
				run cd ${TEST_BUILD_DIR}
				run cmake -G Ninja -DTEST_SUITE_SUBDIRS=External \
				-DTEST_SUITE_EXTERNALS_DIR=${EXT_DIR}/.. \
				-DTEST_SUITE_COLLECT_CODE_SIZE=OFF \
				-DTEST_SUITE_COLLECT_COMPILE_TIME=OFF \
				-DCUDA_GPU_ARCH="${GPU_ARCH}" \
				-DCUDA_JOBS=${CUDA_TEST_JOBS} \
				-DCMAKE_CXX_COMPILER="${DESTDIR}/usr/local/bin/clang++" \
				-DCMAKE_C_COMPILER="${DESTDIR}/usr/local/bin/clang" \
				${TESTSUITE_DIR}

				step "Building CUDA test-suite"
				run ninja cuda-tests-simple

				step "Testing CUDA test-suite"
				run ninja check-cuda-simple

				exit 0

buildbot/google/docker/buildbot-cuda/cudabot/shutdown-script.sh

This file was added.

				#!/bin/bash

				set -euE
				set -o pipefail
				trap 'kill $$' ERR

buildbot/google/terraform/buildbot-cuda/README.md

This file was added.

				This folder contains the Terraform configuration to spawn the build bots.

				Before deploying anything new, use `terraform plan` to check that you're only
				modifying the parts that you intended to.


				# Installation

				To set up your local machine to deploy changes to the cluster follow these
				steps:

				1. Install these tools:
				1. [Terraform](https://learn.hashicorp.com/terraform/getting-started/install.html)
				1. [Google Cloud SDK](https://cloud.google.com/sdk/install)
				1. [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
				1. Run `llvm-zorg/buildbot/google/gcloud_config.sh` to configure the Google
				Cloud SDK.
				1. To configure the GCP credetianls for terraform run:
				```bash
				export GOOGLE_CREDENTIALS=~/.config/gcloud/legacy_credentials/<your email>/adc.json
				```

				# Deploying to new Google Cloud project

				When deploying this cluster to a completely new Google Cloud project, these
				manual steps are required:

				* You need to create the GCP project manually before Terraform works.
				* You also need to go to the Kubernetes page once, to enable Kubernetes and
				Container Registry for that project.
				* GPUs need to be enabled on Kubernetes by following these
				[instructions](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers).


				# Secrets

				To keep secrets a secret, they MUST not be stored in version control. The right
				place on kubernetes is a "secret". To create a kubernetes secret for the agent
				token: `bash kubectl create secret generic buildbot-token-cudabot-gce
				--from-literal=token=<PASSWORD>` The file in `<file name>` then must contain the
				password of the buildbot worker in plain text. In the `Deployment` of a
				container, the secret is defined as a special type of volume and mounted in the
				specified path. During runtime the secret can then be read from that file.

				An example: The secret `buildbot-token-cudabot-gce` is defined (as above) in
				Kubernetes. In the [deployment](buildbot/google/terraform/main.tf) `mlir-nvidia`
				it is used as a volume of type `secret` and then mounted at `/secrets`. During
				the runtime of the docker container, the script
				[run.sh](../docker/buildbot-cudabot-gce/bootstrap.sh) reads the secret from the
				file `/secrets/token` and uses it to create the worker configuration.

buildbot/google/terraform/buildbot-cuda/cudabot-deployment-k80.yaml

This file was added.

				---
				apiVersion: apps/v1
				kind: Deployment
				metadata:
				name: cudabot-k80
				spec:
				# number of instances we want to run
				replicas: 1
				selector:
				matchLabels:
				app: cudabot-k80
				template:
				metadata:
				labels:
				app: cudabot-k80
				spec:
				containers:
				# the image and version we want to run
				- image: "gcr.io/sanitizer-bots/cudabot:28"
				name: cudabot-gce
				# reserve "<number of cores>-1" for this image, kubernetes also
				# needs <1 core for management tools
				resources:
				limits:
				cpu: "7"
				memory: 26Gi
				# also request to use the GPU
				nvidia.com/gpu: "1"
				requests:
				cpu: "7"
				memory: 26Gi
				nvidia.com/gpu: "1"
				env:
				- name: "BUILDBOTS"
				value: "cuda-gce-test-k80-0"
				- name: "GPU_ARCH"
				value: "sm_35"
				# K80 tends to deadlock on parallel tests. Run tests one at a time.
				- name: "CUDA_TEST_JOBS"
				value: "1"
				volumeMounts:
				# mount the secrets into a folder
				- mountPath: /secrets
				mountPropagation: None
				name: buildbot-token
				# Add tmpfs for build/test objects.
				- mountPath: /memfs
				mountPropagation: None
				name: buildbot-memfs
				# specify the nood pool on which to deploy
				nodeSelector:
				pool: nvidia-k80-pool
				restartPolicy: Always
				# FIXME: do we need this if we requested a GPU?
				#tolerations:
				#- effect: NoSchedule
				# key: nvidia.com/gpu
				# operator: Equal
				# value: present
				volumes:
				# declare the secret as a volume so we can mount it
				- name: buildbot-token
				secret:
				optional: false
				secretName: buildbot-token-cudabot-gce
				- name: buildbot-memfs
				emptyDir:
				medium: Memory

buildbot/google/terraform/buildbot-cuda/cudabot-deployment-p4.yaml

This file was added.

				---
				apiVersion: apps/v1
				kind: Deployment
				metadata:
				name: cudabot-p4
				spec:
				# number of instances we want to run
				replicas: 1
				selector:
				matchLabels:
				app: cudabot-p4
				template:
				metadata:
				labels:
				app: cudabot-p4
				spec:
				containers:
				# the image and version we want to run
				- image: "gcr.io/sanitizer-bots/cudabot:28"
				name: cudabot-gce
				# reserve "<number of cores>-1" for this image, kubernetes also
				# needs <1 core for management tools
				resources:
				limits:
				cpu: "7"
				memory: 26Gi
				# also request to use the GPU
				nvidia.com/gpu: "1"
				requests:
				cpu: "7"
				memory: 26Gi
				nvidia.com/gpu: "1"
				env:
				- name: "BUILDBOTS"
				value: "cuda-gce-test-p4-0"
				- name: "GPU_ARCH"
				value: "sm_60"
				volumeMounts:
				# mount the secrets into a folder
				- mountPath: /secrets
				mountPropagation: None
				name: buildbot-token
				# Add tmpfs for build/test objects.
				- mountPath: /memfs
				mountPropagation: None
				name: buildbot-memfs
				lifecycle:
				postStart:
				exec:
				command: ["/bin/sh", "-c", "echo Hello from the postStart handler > /tmp/message"]
				preStop:
				exec:
				command: ["/bin/sh","-c","echo XXX We are going down. > /dev/termination-log"]
				# specify the nood pool on which to deploy
				nodeSelector:
				pool: nvidia-p4-pool
				restartPolicy: Always
				# FIXME: do we need this if we requested a GPU?
				#tolerations:
				#- effect: NoSchedule
				# key: nvidia.com/gpu
				# operator: Equal
				# value: present
				volumes:
				# declare the secret as a volume so we can mount it
				- name: buildbot-token
				secret:
				optional: false
				secretName: buildbot-token-cudabot-gce
				- name: buildbot-memfs
				emptyDir:
				medium: Memory

buildbot/google/terraform/buildbot-cuda/cudabot-deployment-t4.yaml

This file was added.

				---
				apiVersion: apps/v1
				kind: Deployment
				metadata:
				name: cudabot-t4
				spec:
				# number of instances we want to run
				replicas: 1
				selector:
				matchLabels:
				app: cudabot-t4
				template:
				metadata:
				labels:
				app: cudabot-t4
				spec:
				containers:
				# the image and version we want to run
				- image: "gcr.io/sanitizer-bots/cudabot:28"
				name: cudabot-gce
				# reserve "<number of cores>-1" for this image, kubernetes also
				# needs <1 core for management tools
				resources:
				limits:
				cpu: "23"
				memory: 26Gi
				# also request to use the GPU
				nvidia.com/gpu: "1"
				requests:
				cpu: "23"
				memory: 26Gi
				nvidia.com/gpu: "1"
				env:
				- name: "BUILDBOTS"
				value: "cuda-gce-test-t4-0"
				- name: "GPU_ARCH"
				value: "sm_75"
				volumeMounts:
				# mount the secrets into a folder
				- mountPath: /secrets
				mountPropagation: None
				name: buildbot-token
				# Add tmpfs for build/test objects.
				- mountPath: /memfs
				mountPropagation: None
				name: buildbot-memfs
				# specify the nood pool on which to deploy
				nodeSelector:
				pool: nvidia-t4-pool
				restartPolicy: Always
				# FIXME: do we need this if we requested a GPU?
				#tolerations:
				#- effect: NoSchedule
				# key: nvidia.com/gpu
				# operator: Equal
				# value: present
				volumes:
				# declare the secret as a volume so we can mount it
				- name: buildbot-token
				secret:
				optional: false
				secretName: buildbot-token-cudabot-gce
				- name: buildbot-memfs
				emptyDir:
				medium: Memory

buildbot/google/terraform/buildbot-cuda/main.tf

This file was added.


				# configure Google Cloud project
				provider "google" {
				project = var.gcp_config.project
				region = var.gcp_config.region
				}

				resource "null_resource" "update_cluster" {
				# Add NVIDIA driver daemonset.
				depends_on = [google_container_cluster.cudabot_cluster]
				# Update kubectl context for the cluster and apply nvidia's daemonset,
				provisioner "local-exec" {
				command = <<EOT
				gcloud container clusters get-credentials cudabot
				kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
				EOT
				}
				}

				resource "null_resource" "deploy_bots_t4" {
				# Add NVIDIA driver daemonset.
				depends_on = [null_resource.update_cluster]
				triggers = {
				t4_contents = filemd5("${path.module}/cudabot-deployment-t4.yaml")
				}
				# Add NVIDIA daemonset and deploy botx.
				provisioner "local-exec" {
				command = "kubectl apply -f ${path.module}/cudabot-deployment-t4.yaml"
				}
				}
				resource "null_resource" "deploy_bots_p4" {
				# Add NVIDIA driver daemonset.
				depends_on = [null_resource.update_cluster]
				triggers = {
				p4_contents = filemd5("${path.module}/cudabot-deployment-p4.yaml")
				}
				# Add NVIDIA daemonset and deploy botx.
				provisioner "local-exec" {
				command = "kubectl apply -f ${path.module}/cudabot-deployment-p4.yaml"
				}
				}
				resource "null_resource" "deploy_bots_k80" {
				# Add NVIDIA driver daemonset.
				depends_on = [null_resource.update_cluster]
				triggers = {
				k80_contents = filemd5("${path.module}/cudabot-deployment-k80.yaml")
				}
				# Add NVIDIA daemonset and deploy botx.
				provisioner "local-exec" {
				command = "kubectl apply -f ${path.module}/cudabot-deployment-k80.yaml"
				}
				}


				# Create the cluster running all Kubernetes services
				resource "google_container_cluster" "cudabot_cluster" {
				name = "cudabot"
				# maybe have a regional cluster for Kubernetes, as we depend on this...
				location = var.gcp_config.zone_a

				# one node is enough (at the moment)
				initial_node_count = 1

				node_config {
				# We need at least 2 vCPU to run all kubernetes services
				machine_type = "n1-standard-2"
				# preemptible is $15, standard is $50. I think we want it stable.
				preemptible = false
				}

				}

				# Build request queue for the cuda buildbot builder machine(s).
				resource "google_pubsub_topic" "cudabot-build-request" {
				name = "cudabot-build-request"
				}

				# Create machines for cudabot
				resource "google_container_node_pool" "nvidia_t4_pool_nodes" {
				name = "nvidia-t4-pool"
				# specify a zone here (e.g. "-a") to avoid a redundant deployment
				location = var.gcp_config.zone_a
				cluster = google_container_cluster.cudabot_cluster.name
				node_locations = [
				# T4 have good availability in -b and -f. -a has only few of them.
				"us-central1-b",
				"us-central1-f",
				]
				# use autoscaling to only create a machine when there is a deployment
				autoscaling {
				min_node_count = 0
				max_node_count = 1
				}
				upgrade_settings {
				max_surge = 1
				# with max_node_count=1 deployment updates never proceed because we can not
				# shut anything down.
				max_unavailable = 1
				}

				node_config {
				# use preemptible, as this saves costs
				preemptible = true
				# T4 is the cheapest GPU. Use 2 of them in order to get more CPU cores.
				# This machine will handle full range of builds/tests.
				# TODO(tra) use separate machine for builds or reuse some other bot build results.
				machine_type = "n1-custom-24-32768"
				disk_size_gb = 50
				# FIXME: test if SSDs are actually faster than HDDs for our use case
				disk_type = "pd-ssd"
				guest_accelerator {
				type = "nvidia-tesla-t4"
				count = 1
				}

				# set the premissions required for the deployment later
				oauth_scopes = [
				"https://www.googleapis.com/auth/logging.write",
				"https://www.googleapis.com/auth/monitoring",
				# r/w storage access for uploads of build artifacts.
				"https://www.googleapis.com/auth/devstorage.read_write",
				"https://www.googleapis.com/auth/pubsub",
				]

				# add a label to all machines of this type, so we can select them
				# during deployment
				labels = {
				pool = "nvidia-t4-pool"
				}
				}
				}

				resource "google_container_node_pool" "nvidia_p4_pool_nodes" {
				name = "nvidia-p4-pool"
				# specify a zone here (e.g. "-a") to avoid a redundant deployment
				location = var.gcp_config.zone_a
				cluster = google_container_cluster.cudabot_cluster.name
				node_locations = [
				# P4 is available in -a and -b. Not many, but they are also not used much.
				"us-central1-a",
				"us-central1-c",
				]

				# use autoscaling to only create a machine when there is a deployment
				autoscaling {
				min_node_count = 0
				max_node_count = 1
				}

				upgrade_settings {
				max_surge = 1
				# with max_node_count=1 deployment updates never proceed because we can not
				# shut anything down.
				max_unavailable = 1
				}

				node_config {
				# use preemptible, as this saves costs
				preemptible = true
				# P4 is relatively expensive, so we get only one + max number of CPU cores.
				# Full build + GPU tests only.
				machine_type = "n1-custom-8-32768"
				disk_size_gb = 50
				# FIXME: test if SSDs are actually faster than HDDs for our use case
				disk_type = "pd-ssd"
				guest_accelerator {
				type = "nvidia-tesla-p4"
				count = 1
				}

				# set the premissions required for the deployment later
				oauth_scopes = [
				"https://www.googleapis.com/auth/logging.write",
				"https://www.googleapis.com/auth/monitoring",
				"https://www.googleapis.com/auth/devstorage.read_write",
				"https://www.googleapis.com/auth/pubsub",
				]

				# add a label to all machines of this type, so we can select them
				# during deployment
				labels = {
				pool = "nvidia-p4-pool"
				}
				}
				}

				resource "google_container_node_pool" "nvidia_k80_pool_nodes" {
				name = "nvidia-k80-pool"
				# specify a zone here (e.g. "-a") to avoid a redundant deployment
				location = var.gcp_config.zone_a
				cluster = google_container_cluster.cudabot_cluster.name

				# K80 VMS w/o local SSD have very limited availability in us-central1-a
				# The choice is to either use local SSD (+$20/month) in -a or no SSD in -c.
				# I haven't figured out how to ask for a local SSD w/ terraform.
				node_locations = [
				"us-central1-a",
				"us-central1-c",
				]

				# use autoscaling to only create a machine when there is a deployment
				autoscaling {
				min_node_count = 0
				max_node_count = 1
				}

				upgrade_settings {
				max_surge = 1
				# with max_node_count=1 deployment updates never proceed because we can not
				# shut anything down.
				max_unavailable = 1
				}

				node_config {
				# preemptible K80 are virtually unobtainable. Got to use a dedicated one for this.
				preemptible = true
				# Machines with K80 provide only 8 CPU cores per GPU.
				# Limited builds + limited GPU tests.
				machine_type = "n1-custom-8-32768"

				# Workaround for local availability in us-central1-a.
				# Most of K80 are on machines w/ SSD.
				local_ssd_count = 1
				disk_size_gb = 50
				# FIXME: test if SSDs are actually faster than HDDs for our use case
				disk_type = "pd-ssd"
				guest_accelerator {
				type = "nvidia-tesla-k80"
				count = 1
				}

				# set the premissions required for the deployment later
				oauth_scopes = [
				"https://www.googleapis.com/auth/logging.write",
				"https://www.googleapis.com/auth/monitoring",
				"https://www.googleapis.com/auth/devstorage.read_write",
				"https://www.googleapis.com/auth/pubsub",
				]

				# add a label to all machines of this type, so we can select them
				# during deployment
				labels = {
				pool = "nvidia-k80-pool"
				}
				}
				}

buildbot/google/terraform/buildbot-cuda/outputs.tf

This file was added.

This is an empty file.

buildbot/google/terraform/buildbot-cuda/terraform.tfvars

This file was added.