This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] Move Linaro AArch64 buildbots to buildkite
ClosedPublic

Authored by DavidSpickett on Feb 8 2021, 7:49 AM.

Details

Reviewers
ldionne
Group Reviewers
Restricted Project
Commits
rG6e5342a6b0f4: [libcxx] Move Linaro AArch64 buildbots to buildkite

Diff Detail

Event Timeline

DavidSpickett created this revision.Feb 8 2021, 7:49 AM
DavidSpickett requested review of this revision.Feb 8 2021, 7:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 8 2021, 7:49 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
DavidSpickett planned changes to this revision.Feb 8 2021, 7:49 AM

Turns out that yes I can use tags other than "queue" for the agents,
which answers my own question.

https://buildkite.com/docs/pipelines/command-step

Thanks for moving the bots over! FWIW, either using two tags (as you do here) or using a descriptive queue name that includes the architecture is fine by me.

When you update this review, could you please rebase onto main? It will get some additional changes that should reduce the load on our macOS CI.

DavidSpickett planned changes to this revision.Feb 16 2021, 3:18 AM

Rebase, testing connection.

Temporarily remove other builders to test AArch64 only.

DavidSpickett planned changes to this revision.Feb 16 2021, 6:01 AM

Misc edit to trigger another run.

DavidSpickett planned changes to this revision.Feb 16 2021, 6:16 AM

Another random edit to start a new build. This time running on the final hardware.

DavidSpickett planned changes to this revision.Feb 17 2021, 3:58 AM

Time with both configs.

DavidSpickett planned changes to this revision.Feb 17 2021, 4:18 AM

This is looking pretty good. Your failing tests appear to be due to flaky tests, I just checked in the following to help:

commit 642048eea041ff79aa9e8a934edff2415ab16447
Author: Louis Dionne <ldionne.2@gmail.com>
Date:   Wed Feb 17 11:19:37 2021 -0500

    [libc++] Allow retries in a few more flaky tests

Can you please rebase on top of main and clean up the patch? I think we'll be good to go.

Also, what's your CI capacity like? From the build history, it looks like there's a single builder that takes roughly 1h30 to build. I expect that is going to be insufficient and that will stall the CI queue. Is there any way you could get an additional builder or a faster one? Otherwise, if you were able to check-in the Dockerfile you use into the libcxx repository, we could look into using the capacity we have on GCE to run those. Our GCE instances are huge and work like a charm.

We split the build bots over a few big machines, so timing sensitive tests are often an issue. Thanks for the patch.

Also, what's your CI capacity like? From the build history, it looks like there's a single builder that takes roughly 1h30 to build. I expect that is going to be insufficient and that will stall the CI queue. Is there any way you could get an additional builder or a faster one? Otherwise, if you were able to check-in the Dockerfile you use into the libcxx repository, we could look into using the capacity we have on GCE to run those. Our GCE instances are huge and work like a charm.

It's actually worse than that since I've only got one bot running at the moment, so the total is 3hr. I just wanted a baseline for how slow it is if we just treated it like the existing post commit bot. I'm sure I can bring that down a lot.

Do you happen to know how much extra capacity moving to pre-commit required for other bots? I assume there's maybe 2/3x more pre-commit runs than post commit.

(Side note: Don't take the ~4hr runtime of our post commit bots seriously. It's actually using make -j1, which I only discovered doing this. Good thing they'll be obsolete soon anyway.)

We split the build bots over a few big machines, so timing sensitive tests are often an issue. Thanks for the patch.

Also, what's your CI capacity like? From the build history, it looks like there's a single builder that takes roughly 1h30 to build. I expect that is going to be insufficient and that will stall the CI queue. Is there any way you could get an additional builder or a faster one? Otherwise, if you were able to check-in the Dockerfile you use into the libcxx repository, we could look into using the capacity we have on GCE to run those. Our GCE instances are huge and work like a charm.

It's actually worse than that since I've only got one bot running at the moment, so the total is 3hr. I just wanted a baseline for how slow it is if we just treated it like the existing post commit bot. I'm sure I can bring that down a lot.

Got it. One thing we can do to reduce contention is put your jobs after the - wait step in the BuildKite pipeline. That way, your jobs will only run if all the jobs above the wait succeeded. I use that to reduce the load on the macOS testers and it helps a lot.

Do you happen to know how much extra capacity moving to pre-commit required for other bots? I assume there's maybe 2/3x more pre-commit runs than post commit.

To be honest, I don't know because the GCE instances that we use for all of our main jobs are so beefy they finish in a few minutes. In fact, they are scaled up/down automatically based on the number of jobs. To give you a general guideline based on what I've been seeing since the start of pre-commit CI, I think if you can have machines that run the tests in 30-45 minutes, just one or two of those should be sufficient since you only have two build jobs to dispatch. We can also add just one of the two jobs (say the one with exceptions enabled) with your current capacity and see how things go over the next few days/weeks. We'll adjust then.

Gentle ping :-). It would be awesome to be able to move off buildbot completely.

DavidSpickett updated this revision to Diff 327406.EditedMar 2 2021, 3:41 AM

Now with 2 agents, hopefully with enough cores for decent build times.

I'm thinking that I'll leave them after the "wait" until I'm confident we can deliver a decent turnaround time.

@ldionne Getting ~1/2hr for each config https://buildkite.com/llvm-project/libcxx-ci/builds/1685#_. This is with 2 agents one per config so most of the time they'll run in parallel. Ok to get this reviewed as is and see how it goes?

In the meantime I'll do the prep to move the other 4 and remove the buildbot instances.

ldionne accepted this revision.Mar 2 2021, 7:36 AM

Excellent, this LGTM! Thanks a lot!

Do you have commit access?

libcxx/cmake/caches/AArch64.cmake
2–3

Non blocking question: Is it possible to use something like --target instead? Could we set LIBCXX_TARGET_TRIPLE instead?

This revision is now accepted and ready to land.Mar 2 2021, 7:36 AM

Set target triple instead of cpu.

Remove stray "ON" from string triple line.

DavidSpickett marked an inline comment as done.Mar 3 2021, 4:02 AM

Switched to target triple, still good to commit? (I have commit access)

ldionne accepted this revision.Mar 3 2021, 10:51 AM

Switched to target triple, still good to commit? (I have commit access)

Yes, this looks perfect. Thanks a lot! You can disregard the back-deployment CI, it's been failing due to some artifacts not being available.

This revision was landed with ongoing or failed builds.Mar 4 2021, 2:22 AM
This revision was automatically updated to reflect the committed changes.