Details
- Reviewers
ldionne - Group Reviewers
Restricted Project - Commits
- rG6e5342a6b0f4: [libcxx] Move Linaro AArch64 buildbots to buildkite
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Turns out that yes I can use tags other than "queue" for the agents,
which answers my own question.
Thanks for moving the bots over! FWIW, either using two tags (as you do here) or using a descriptive queue name that includes the architecture is fine by me.
When you update this review, could you please rebase onto main? It will get some additional changes that should reduce the load on our macOS CI.
This is looking pretty good. Your failing tests appear to be due to flaky tests, I just checked in the following to help:
commit 642048eea041ff79aa9e8a934edff2415ab16447 Author: Louis Dionne <ldionne.2@gmail.com> Date: Wed Feb 17 11:19:37 2021 -0500 [libc++] Allow retries in a few more flaky tests
Can you please rebase on top of main and clean up the patch? I think we'll be good to go.
Also, what's your CI capacity like? From the build history, it looks like there's a single builder that takes roughly 1h30 to build. I expect that is going to be insufficient and that will stall the CI queue. Is there any way you could get an additional builder or a faster one? Otherwise, if you were able to check-in the Dockerfile you use into the libcxx repository, we could look into using the capacity we have on GCE to run those. Our GCE instances are huge and work like a charm.
We split the build bots over a few big machines, so timing sensitive tests are often an issue. Thanks for the patch.
Also, what's your CI capacity like? From the build history, it looks like there's a single builder that takes roughly 1h30 to build. I expect that is going to be insufficient and that will stall the CI queue. Is there any way you could get an additional builder or a faster one? Otherwise, if you were able to check-in the Dockerfile you use into the libcxx repository, we could look into using the capacity we have on GCE to run those. Our GCE instances are huge and work like a charm.
It's actually worse than that since I've only got one bot running at the moment, so the total is 3hr. I just wanted a baseline for how slow it is if we just treated it like the existing post commit bot. I'm sure I can bring that down a lot.
Do you happen to know how much extra capacity moving to pre-commit required for other bots? I assume there's maybe 2/3x more pre-commit runs than post commit.
(Side note: Don't take the ~4hr runtime of our post commit bots seriously. It's actually using make -j1, which I only discovered doing this. Good thing they'll be obsolete soon anyway.)
Got it. One thing we can do to reduce contention is put your jobs after the - wait step in the BuildKite pipeline. That way, your jobs will only run if all the jobs above the wait succeeded. I use that to reduce the load on the macOS testers and it helps a lot.
Do you happen to know how much extra capacity moving to pre-commit required for other bots? I assume there's maybe 2/3x more pre-commit runs than post commit.
To be honest, I don't know because the GCE instances that we use for all of our main jobs are so beefy they finish in a few minutes. In fact, they are scaled up/down automatically based on the number of jobs. To give you a general guideline based on what I've been seeing since the start of pre-commit CI, I think if you can have machines that run the tests in 30-45 minutes, just one or two of those should be sufficient since you only have two build jobs to dispatch. We can also add just one of the two jobs (say the one with exceptions enabled) with your current capacity and see how things go over the next few days/weeks. We'll adjust then.
Now with 2 agents, hopefully with enough cores for decent build times.
I'm thinking that I'll leave them after the "wait" until I'm confident we can deliver a decent turnaround time.
@ldionne Getting ~1/2hr for each config https://buildkite.com/llvm-project/libcxx-ci/builds/1685#_. This is with 2 agents one per config so most of the time they'll run in parallel. Ok to get this reviewed as is and see how it goes?
In the meantime I'll do the prep to move the other 4 and remove the buildbot instances.
Excellent, this LGTM! Thanks a lot!
Do you have commit access?
libcxx/cmake/caches/AArch64.cmake | ||
---|---|---|
3–4 | Non blocking question: Is it possible to use something like --target instead? Could we set LIBCXX_TARGET_TRIPLE instead? |
Yes, this looks perfect. Thanks a lot! You can disregard the back-deployment CI, it's been failing due to some artifacts not being available.