This is an archive of the discontinued LLVM Phabricator instance.

[buildbot] Added config files for CUDA build bots
Needs ReviewPublic

Authored by tra on Jul 21 2020, 11:49 AM.

Details

Reviewers
kuhnel
gkistanova

Event Timeline

tra created this revision.Jul 21 2020, 11:49 AM

Why do you want to double the config files and scripts?
Why create another cluster and another node pool?
We can share these across our machines.

Also see my comment in D84256.

tra added a comment.Jul 22 2020, 11:13 AM

Why do you want to double the config files and scripts?

terraform script can be merged, but...

Why create another cluster and another node pool?

I don't want to nuke an already-running mlir cluster when I'm changing something in my setup. Many terraform operations result in 'tear down everything and create it from scratch'.
Also, MLIR's requirements are not exactly identical to my CUDA bot requirements. If you've noticed, the VMs in cudabot clusters have notably different configuration from the MLIR's ones.

We can share these across our machines.

I'm not sure about that. Both MLIR and CUDA need a GPU to work with and GPUs are not shareable. So, if a machine already runs a pod which requested a GPU. So, in the end you will need the same number of VMs w/ GPUs. You could share the controller, but that's a negligible cost compared to everything else. In addition to that, the VM configuration for CUDA bots is tweaked for the CUDA buildbot workload. One of the pools has 24 cores, while the other two run with only 8 (and I may further reduce it). That arrangement may not be the right one for the MLIR.

Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.

In D84258#2167464, @tra wrote:

Why create another cluster and another node pool?

I don't want to nuke an already-running mlir cluster when I'm changing something in my setup. Many terraform operations result in 'tear down everything and create it from scratch'.
Also, MLIR's requirements are not exactly identical to my CUDA bot requirements. If you've noticed, the VMs in cudabot clusters have notably different configuration from the MLIR's ones.

The buildbots get restarted every 24h anyway. So I suppose they can handle 1-2 more restarts. I also would not expect that many re-deployments of the cluster or of the node pools. At least for my setup this has become relatively stable. Minor changes can even be done on the fly. I only re-deployed the cluster today to move from 16 to 32 cores. And that causes a re-deployment anyway.

We can share these across our machines.

I'm not sure about that. Both MLIR and CUDA need a GPU to work with and GPUs are not shareable. So, if a machine already runs a pod which requested a GPU. So, in the end you will need the same number of VMs w/ GPUs. You could share the controller, but that's a negligible cost compared to everything else. In addition to that, the VM configuration for CUDA bots is tweaked for the CUDA buildbot workload. One of the pools has 24 cores, while the other two run with only 8 (and I may further reduce it). That arrangement may not be the right one for the MLIR.

I would not run multiple containers on one VM. As you said, k8s cannot share one GPUs across containers. I would rather create one "build slave" per VM (or a group of "build slaves" each in a separate VM) in buildbot and then have that VM(s) execute a set of "builders". We could have an m:n mapping of "build slaves" and "builders".

My mlir-nvidia builder is not very picky. It would probably run on any of your machines as long as it has an Nvidia card. Sorry about the non-inclusive wording here, but that's what buildbot calls them in the UI.

Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.

But yes, having a tighter coupling would increase the number of conflicts of parallel edits. We would also somehow have to make sure we're not trying to deploy two different things in parallel...

tra added a comment.Jul 24 2020, 11:18 AM

I would not run multiple containers on one VM. As you said, k8s cannot share one GPUs across containers. I would rather create one "build slave" per VM (or a group of "build slaves" each in a separate VM) in buildbot and then have that VM(s) execute a set of "builders". We could have an m:n mapping of "build slaves" and "builders".

My mlir-nvidia builder is not very picky. It would probably run on any of your machines as long as it has an Nvidia card. Sorry about the non-inclusive wording here, but that's what buildbot calls them in the UI.

That may be doable.
At the moment, all CUDA bots do their own build of test-suite tests. I'm planning to figure out how to build them once and get the GPU-enabled machines to only do tests.
GPU tests are relatively fast, compared to building them so there will be plenty of time to share.
If MLIR can also split build from test, then sharing GPU-enabled VM for multiple builders makes sense.
However, as long as each builder is expected to compile something substantial, sharing will be at the expense of higher latency for the bot results -- one of the issues we want to fix here.

Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.

But yes, having a tighter coupling would increase the number of conflicts of parallel edits. We would also somehow have to make sure we're not trying to deploy two different things in parallel...

Interlocking two builders is relatively easy if we use annotated builder scripts. We can just add flock /builder/global-build-lock to the build scripts at strategic points.

Let's keep the clusters & pools separate for now. We'll revisit the issue once I evolve the setup to have separate build/test machines. Then we can consider consolidating things.

tra updated this revision to Diff 280545.Jul 24 2020, 12:11 PM

Updated directory structure.