This is an archive of the discontinued LLVM Phabricator instance.

Add a best practice section on how to configure a fast builder
ClosedPublic

Authored by reames on Nov 20 2021, 12:59 PM.

Details

Summary

This is based on conversations with a couple of folks currently running buildbots. There's a couple pieces which didn't make it in, but this tries to cover the common themes.

Diff Detail

Event Timeline

reames created this revision.Nov 20 2021, 12:59 PM
reames requested review of this revision.Nov 20 2021, 12:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 20 2021, 12:59 PM
mehdi_amini added inline comments.Nov 20 2021, 4:16 PM
llvm/docs/HowToAddABuilder.rst
214
235

For this to be effective, we're missing (I think) the ability to get notified on specific emails for builders that are on staging.

Looks good to me overall! Thanks

Overall looks good, thanks for writing this!

I have a few comments inline, which are mostly my own experience and from a while ago, so grain of salt.

llvm/docs/HowToAddABuilder.rst
194

It also makes it easier to identify what's broken before you delve on the logs. If the number of commits is low and the build is restricted to a part of the compiler, and other bots on the same target pass, then stands to reason that the failure is in the intersection between { commit, target, feature }.

When having more than one commit in a build (in cases where it's too hard to make bots faster), this reduces a lot the work of bisecting and identifying the offending commit.

I have done this with the Arm bots in their early stages and it has helped me a lot.

198

RelWithDebugInfo is perhaps even more helpful, because you test the optimisation pipeline, get smaller objects to link, and still, in case of stack traces, you can see from the logs directly where to begin looking.

202

Most importantly, LLD reduces memory consumption, sometimes by orders of magnitude.

On targets with small memory this is paramount to faster builds, sometimes to finishing a build at all.

But even on larger memory targets, this speeds up a lot the linking process, too.

211

This is true, however in the more limited targets (like our initial Cortex A9 targets back in the day), even ccache wasn't enough to speed up our builds, mostly because autoconf+make wasn't smart enough, IIRC.

It's possible that CMake+ninja plays well enough with ccache that makes incremental builds obsolete, but I have not done the proper testing to confirm that.

217

This is true for fast disks (hd, ssd or better). On targets that build on slower disk interfaces (mmc, sd, nfs) multiple parallel access to the cache storage can thrash performance to a point that it will dominate the critical path.

In those cases, incremental builds are *much* faster because they never get triggered in the first place.

222

Absolutely! Unless you can control everything on all users' environments (toolchain, libs, OS, CMake config), you can't have a generic cache shared amongst builders.

It is possible to have shared builds cache, but the management of such a cache is non-trivial and not the goal of our buildbot infrastructure.

Of course, bot owners can, if they want, set up such a cache, but it's an exercise to the reader, not the general recommendation.

235

Not really. The staging builder sole purpose is to make sure we *don't* get notified, so the bot and target owners can work in the background, before going public with a noisy infrastructure.

The only thing we should do in the staging master is to notify the bot owner, but even that is probably redundant, as the bots only remain in the staging master while they can't be on the production one, so bot owners will be actively looking at them, trying to make them green as soon as possible.

In that case, they will probably see the failure before the email hits their inbox.

What I mean is that it would be an anti-pattern to turn on notifications here, because bot owners may be encouraged to let their bots stay on the staging master for longer than they should, because they have some level of notification.

This is bad because we'll end up with a tiered quality infrastructure, where some bots warn some people, while others warn more people, or no people. This will make it harder for a random developer in the community to know what they break or not and the idea that "all bots should be green" collapses.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 21 2021, 8:01 AM
This revision was automatically updated to reflect the committed changes.

I went ahead and landed this as is, but plan to do a second pass (with separate review/discussion where appropriate) to incorporate @rengolin additional comments.

mehdi_amini added inline comments.Nov 21 2021, 11:30 AM
llvm/docs/HowToAddABuilder.rst
198

RelWithDebugInfo seems a bit heavy to me: the objects gets ~10x larger IRRC.
If what you're about is better stack traces in case of crashes, then -gmlt (line tables only) gets it to you without blowing up the disk size / link time.

235

Not really.

It seems to me that you missed the point of this section: it explicitly talks about the possibility that "builders can run indefinitely on the staging buildmaster", as well as "The sponsoring organization simply has to take on the responsibility of all bisection and triage".

This is bad because we'll end up with a tiered quality infrastructure, where some bots warn some people, while others warn more people, or no people.

This already exists. There are many bots that are just not on buildbots, for example https://buildkite.com/llvm-project/llvm-main/builds?branch=main or https://buildkite.com/mlir/mlir-core/builds?branch=main
We could also mention Chromium bots, or other people that build downstream for their own reason.

The only question is should this "second tiered" infrastructure be hosted by our infrastructure (could be staging bot, could be a silent mode on the production buildbot) or whether you just prefer that these folks stay on their own infra (which does not change much in practice: I'm reverting patches that break my buildkite bots...).

Finally, I have bots on staging for a couple of months, and haven't been confident to migrate them to prod (busy with other stuff) and I don't spend much time monitoring them. In absence of email on failures I can't build the confidence to migrate them, so they sit there for now... I would very much appreciated being notified on these bots instead of having to poll every day to figure out if they are stable enough!

reames updated this revision to Diff 388764.Nov 21 2021, 12:39 PM

Note - original patch has landed. Reopening with revisited wording to address suggestions brought up review to avoid fragmenting discussion. This seems like the least confusing option, let me know if you'd rather I do this with a separate review.

reames marked 7 inline comments as done.Nov 21 2021, 12:45 PM
reames added inline comments.
llvm/docs/HowToAddABuilder.rst
198

I think I managed to address this with the revised wording, let me know if further tweaking is warranted.

222

I don't understand your comment. The text you replied to was specific to a cache shared between the workers of a single builder, did you maybe think this was broader scoped?

235

The capability to have a bot which notifies only the maintainer has come up multiple times. I think we definitely need to document how to do that. I believe we already can, but checking my notes, I didn't write down how. Will pay attention to this in next round of conversation and see if I can figure out how to do this with current infrastructure.

I vaguely remember it being a mode on the main buildmaster, not staging, but will have to confirm.

rengolin added inline comments.Nov 22 2021, 10:05 AM
llvm/docs/HowToAddABuilder.rst
222

I did, but tbh, this can also hit the same builder. For example, when the CMake files change from one commit to the next, so two bots on the same builder build with different configurations.

235

It seems to me that you missed the point of this section: it explicitly talks about the possibility that "builders can run indefinitely on the staging buildmaster"

I did not.

Some bot owners decide to keep their bots on the staging master with the knowledge that no one will know when they break, and that's the cost they pay.

This already exists. There are many bots that are just not on buildbots, for example https://buildkite.com/llvm-project/llvm-main/builds?branch=main or https://buildkite.com/mlir/mlir-core/builds?branch=main. We could also mention Chromium bots, or other people that build downstream for their own reason.

None of those bots are "official", so it's not the same thing. People can setup whatever they want wherever they can, but that doesn't mean we (the rest of the community) have to care about their infrastructure.

Some infrastructure has been accepted, for example the Green bots, but that was only after proving that they have a stable infrastructure to begin with, providing the same level of quality we have from the main bots.

The only question is should this "second tiered" infrastructure be hosted by our infrastructure (could be staging bot, could be a silent mode on the production buildbot) or whether you just prefer that these folks stay on their own infra (which does not change much in practice: I'm reverting patches that break my buildkite bots...).

I think you're mixing things... Staging bots are silent bots, production bots are noisy bots. There isn't much else different between them. So, silent production bot makes no sense.

Reverting patches because an unofficial bot breaks, without any warning, seems wrong to me. If those bots are important to the community, I suggest you follow the same path as Apple did with the Green bots and get those bots sanctioned to act as production/staging, so that everyone can see and be notified on the stable breakages.

There should be no second tier, in the same way that there should be no "yellow" bots (ie. known broken, in production). It's either green or red. Production red means genuine failure or it must move to the staging master.

We've been operating under that assumption for more than a decade and I think we shouldn't relax the rules just because it's difficult to maintain those promises.

Finally, I have bots on staging for a couple of months, and haven't been confident to migrate them to prod (busy with other stuff) and I don't spend much time monitoring them. In absence of email on failures I can't build the confidence to migrate them, so they sit there for now... I would very much appreciated being notified on these bots instead of having to poll every day to figure out if they are stable enough!

This is what we were discussing about mailing the bot owner only. I think there's an option in buildbot to do that.

When I was managing the Arm bots, I created a webpage that queried the bots every 5 minutes and displayed a grid of green/red statuses and links to the failed builds. This was a way to have a clean interface with only the bots I cared about.

Most of them were in the production master, but I still wanted to know if people were acting on their emails. Some were on the staging master, and that was an ok way to check up on them, at the time.

@reames thanks for the changes, they look good to me.

Generally seems like a good direction to go in, if it's feasible - but it seems like a pretty big step up in resource expectations than the past/current buildbots/workers, so I'm not sure how feasible it is to get those resources.

llvm/docs/HowToAddABuilder.rst
154–155

do we have any builders that achieve this consistently (I wouldn't think so, given the resources required)? Maybe worth rephrasing if it's not actually achievable/achieved generally to something more in line with the practical reality?

If this document is more aspirational/trying to set a fairly new (albeit good, but perhaps not feasible?) direction - maybe it'd be more suitable in a different form/forum?

198

Split DWARF can reduce linker input size/time too, for what that's worth - but maybe enough in the noise/weeds/details to be omitted here.

211

what smartness was lacking from autoconf+make that meant ccache was ineffective in this situation? (I thought the point of ccache was that it didn't matter how the build system worked, basically)

212–216

Seems like we should figure out how to make incremental builds more reliable - to benefit developers (& then have buildbots using incremental builds to ensure they do keep working so developers can benefit from them being reliable). But, yeah, if it's just not practical today, so be it.

224–227

Is this about multiple workers on the same machine, or some kind of network shared cache? Presumably if we're suggesting people have multiple workers per builder (to get fast enough cycle time/short enough blame list) - that's multiple machines (since generally we could get enough parallelism to saturate a machine in the build - I guess not all the time, so maybe there's some parallelism benefit to multiple workers on the same machine?)

236–238

That's the default/what most (all?) of the buildbots are doing today, though, yeah?

rengolin added inline comments.Nov 22 2021, 10:56 AM
llvm/docs/HowToAddABuilder.rst
154–155

I don't think we have many, if any, but I interpreted it as "preference" and "best practices", not that we don't accept others. I agree we shouldn't be discouraging people to set buildbots if they can't follow these guidelines.

211

I really can't remember, this was 8 years ago, but I vaguely remember that when I moved to CMake+ninja, the issues I had were gone.

Every time I setup ccache I hit some kind of problem that needs a full rebuild. I'm pretty sure that's just because I don't know how to setup ccache correctly, so mostly this should be filed under "maybe ccache isn't as trivial as hoped" or "Renato isn't smart (or patient) enough to setup ccache correctly". :)

212–216

for a number of years I used incremental builds on Arm with very little trouble. I had to clean the build directory perhaps a couple of times a year when something (that I don't remember) happened, but otherwise it was way better than full builds and ccache (due to using SSD or USB2 disks on dev boards).

224–227

I interpreted as a network cache. I suppose it could be the same machine, too, though it would use the cache in a similar way if you use containers, for example.

Low memory machines have to restrict linking parallel settings, so running two builds at the same time could still OOM-kill builds. High memory machines (using LLD on release mode) have the linking phase fast enough that multiple builds tend to not help much. GCC builds used to be much less parallel than LLVM, so it worked well for them.

For a while, for Arm64, we didn't have a lot of machines, so we put multiple (different) builders on the same machine, but that couldn't use the same cache anyway.

236–238

Yeah, I hadn't seen it that way, but I guess you're right.

Perhaps we should make clear that this is an aspiration, not strong recommendation.

reames marked an inline comment as done.Nov 22 2021, 10:59 AM

Generally seems like a good direction to go in, if it's feasible - but it seems like a pretty big step up in resource expectations than the past/current buildbots/workers, so I'm not sure how feasible it is to get those resources.

Er, no. This is very much documenting current status. While we have a bunch of batch builders, we also have a bunch which do build every commit.

See every builder with collapseRequests: false in https://github.com/llvm/llvm-zorg/blob/main/buildbot/osuosl/master/config/builders.py. Admittedly, this is only 12 out of 142 registered builders, but that doesn't count all the ones which keep up in practice and simply haven't committed to it.

llvm/docs/HowToAddABuilder.rst
154–155

As noted in the top level comment, we have a bunch of builders which do keep up building every commit.

This is aspirational, but only in the sense that a new builder which can't meet this bar has to explain why it's still worthwhile having as a notifying builder. We may accept it, but the burden of justification is definitely on the bot owner.

The main glide path I see - which we need better infrastructure for - is allowing "small" (2-3) commit batches as a graceful fallback when fully keeping up isn't practical.

198

If someone with experience can provide guidance here, I'm happy to write it up. I just haven't heard successful deployments yet.

212–216

I agree with the goal, but this document's purpose is to provide best practice for current reality. If we get incremental builds working reliably, we can change the recommendation.

224–227

I don't know. The little bit of discussion I've had with existing bot owners is mostly around the problems introduced with multiple workers on the same machine. I don't currently have any guidelines to suggest on this balancing act and thus left it out.

If you have experience with this, let's talk offline.

236–238

See top level comments.

This revision was landed with ongoing or failed builds.Nov 22 2021, 11:02 AM
reames added inline comments.Nov 22 2021, 11:07 AM
llvm/docs/HowToAddABuilder.rst
154–155

Ok, let me walk my last comment back a bit.

As written, this is a best practice. It *does not* require any higher burden of justification for new bots, and there's nothing in the document which currently frames it that way. (Or at least, if there is, the wording is a mistake and I'll fix that.)

Longer term, I *do* think we should be rejecting build bots which can't demonstrate a worthwhile tradeoff between benefit to single config and community impact. However, that needs a lot of broader discussion first. IMO, we have several today which do not justify their existence.

mehdi_amini added inline comments.Nov 22 2021, 11:11 AM
llvm/docs/HowToAddABuilder.rst
235

I don't understand or agree with your opinion, I obviously have a very different view from you about the dynamic of how everything is setup..
But I'm not sure we have to argue about it though: many people have bots like that, and they contribute significantly to the project by reverting patches that breaks something that isn't tested upstream for some reasons.
I don't quite get why we should force people to make their bots noisy: they take on the maintenance for the benefit of everyone!

Reverting patches because an unofficial bot breaks, without any warning, seems wrong to me.

How so?
How different is it that a revert because of a miscompile we detect while testing a downstream project? For example an LLVM commit in an optimization introduces a miscompile in the Rust compiler, or a miscompile with Clang when building Chromium: we consistently revert such cases as well.

rengolin added inline comments.Nov 22 2021, 11:30 AM
llvm/docs/HowToAddABuilder.rst
235

I don't quite get why we should force people to make their bots noisy: they take on the maintenance for the benefit of everyone!

That's not what I proposed, just that it's either noisy (every one gets notified) or it's not (no one gets notified). Production / Staging.

Notifying the bot owner on staging seems a good thing to me, too.

How different is it that a revert because of a miscompile we detect while testing a downstream project?

Upstream bots can be more aggressive on reverts, even before notifying authors, because they build every LLVM commit and notify authors of breakages. Those are accepted by the community to have that privilege because their owners (should) have agreed on a minimum service level.

Staging/secondary bots may build every commit but they're silent and the cost is on admins, not the community. Those should work with authors before reverting their patches, regardless of what they build. If they want to be more pro-active, they should be part of the "upstream" bundle (and pay the cost).

Downstream projects (chromium, rust, android, company internal, etc.) can delay rebasing if there's a bug that needs fixing, and work with authors to fix the issue upstream.

If the above split is muddled, it shouldn't be a reason for continue doing it, but a reason to stop doing it and move to production if you want the same behaviour. Otherwise, it'd push the effort cost into the community and that's not fair.

reames added inline comments.Nov 22 2021, 11:45 AM
llvm/docs/HowToAddABuilder.rst
235

I suspect this discussion should move to another forum - e.g. llvm-dev - as it's gotten a bit off topic here.

My personal take is that reverts with a day or two of commit *from any source* are perfectly fine *provided* a reproduced test case is provided. I think we even document that in the developer policy. To me, the major question here is around "when should we notify", not "when should we revert".

yeah, generally I'm OK with all of this - eventually figuring out ways to migrate existing builders that are way out-of-spec (hours of blame list, etc) out of blame email notifications, back to staging or the like. (& yeah, as @mehdi_amini said, seems totally suitable that staging bots should (I think even by default, really) email builder owners on failure (possibly on every failure, not only green to red transitions))

llvm/docs/HowToAddABuilder.rst
154–155

Fair enough.

I like the idea, I'm just not sure how feasible it is - given the number of buildbots we have that don't meet that bar today. How much testing we would lose if we made it a requirement (versus how much better test coverage we would gain from folks able to, but unaware of/unable to justify the extra hardware without that sort of encouragement/pushback)

I guess what I might say is "we generally have a strong preference for builders which can build every commit as they come in" - I'm not sure that the community behavior/norms/buildbot configurations so far indicate that "we" have that "strong preference". I think it's a good thing to have but might be an overreach to claim we do have it without some broader consensus gathering?

But happy to step back and leave you to this/leave it as-is too.