This is an archive of the discontinued LLVM Phabricator instance.

MemoryBuffer: Migrate to Expected/llvm::Error from ErrorOr/std::error_code
AcceptedPublic

Authored by dblaikie on Sep 7 2021, 12:49 AM.

Details

Summary

Specifically to improve diagnostic/error messages - including the
filename in any error of an API that takes a file name (so not APIs that
take raw MemoryBuffers - don't uset the MemoryBuffer's identifier as the
name, because sometimes that's not adequate (such as for MemoryBuffer's
referring to files inside an archive) - and not for APIs taking a raw
file descriptor either).

This was spotted as a defficiency in llvm-jitlink which was complaining
about a missing file but not reporting which file.

Rather than patching that one case of a missing filename up in the
caller as most other callers do to workaround this, I figured I'd try
fixing this at a lower level - which involved plumbing Expected/Error
through those APIs to be able to carry the file name - and then fixing
up lots of callers and lots of test cases for slight differences in
syntax (mostly made diagnostic messages more consistent across tools
about quoting versus not quoting the filename, etc).

One or two cases got a bit worse - like llvm-symbolizer's yaml output,
now the diagnostic itself includes the filename, in addition to the file
yaml field. This could probably be improved/fixed by special-casing
FileError in the diagnostic printing to pull out/separate the file name
in the diagnostic.

Otherwise it's I think just a lot of cleanup.

Oh, it also included changing FileError's convertToErrorCode to look
down through it's nested ErrorInfo for error codes - this makes it
possible to use FileError when a caller is trying to check for a
specific file related error (such as 'file not found') which was being
done in a couple of places (& without this fix would've created a
strange tension where APIs wanting to support such callers would have to
use StringError or ECError instead of FileError, which seemed
counterintuitive/inelegant/difficult to deal with - in terms of then
trying to have consistent syntax/handling for file errors, and to
support the possibility of splitting out filenames as mentioned above
for the symbolizer output).

The Windows/Path.inc is entirely untested - will have to test it out on
buildbots, unless someone wants to give it a go before I submit.

Diff Detail

Event Timeline

dblaikie created this revision.Sep 7 2021, 12:49 AM
Herald added a reviewer: andreadb. · View Herald Transcript
Herald added a reviewer: MaskRay. · View Herald Transcript
Herald added a reviewer: gkm. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a reviewer: Restricted Project. · View Herald Transcript
dblaikie requested review of this revision.Sep 7 2021, 12:49 AM
Herald added projects: Restricted Project, Restricted Project, Restricted Project, Restricted Project. · View Herald TranscriptSep 7 2021, 12:49 AM
thopre added a comment.Sep 7 2021, 3:02 AM

Is there no way to split this patch further? It's going to be hard finding someone who can review something so big. If there's no way to split it in incremental changes, you could perhaps split per subsystem only for review and refer to this diff for CI as well as when landing.

Is there no way to split this patch further? It's going to be hard finding someone who can review something so big. If there's no way to split it in incremental changes, you could perhaps split per subsystem only for review and refer to this diff for CI as well as when landing.

The long migration path would be to do this one function at. time (I did a whole cluster of functions in MemoryBuffer for consistency - this does reduce total code changes somewhat, since some of those APIs are used in similar contexts (eg: branches of a conditional operator - so having them differ means more revisions to those call sites)) and probably introducing separate names for the Expected versions of the functions, migrating call sites incrementally, then doing a mechanical rename at the end of all that.

I don't think it's probably worth that level of granularity - it's a fairly mechanical patch as it is.

Mostly I sent this out as an FYI and to get feedback on the general direction - whether folks thought it was worth doing at all, and if they feel strongly it should be done another way (like the incremental ones above) - but I don't think it needs a /huge/ amount of scrutiny, review by separate code owners, etc. I'd generally be comfortable committing this as other cross-project cleanup/refactoring like function renaming, etc.

This seems like the right direction to me! Especially like the look-through-the-ErrorInfo change for FileError -- I hit this before and found it annoying.

IMO, it'd be valuable to split out the consequential functional changes:

  • Improvements/changes you made to FileError could be committed ahead of time
  • Other improvements you discussed to avoid regressions in (e.g.) llvm-symbolizer (seems potentially important?)

I agree the other changes are mostly mechanical and don't all need careful review-by-subcomponents.

That said, it looks very painful for downstream clients of the LLVM C++ API since it's structured as an all-or-nothing change. Especially for managing cherry-picks to long-lived stable branches. It's painful because clients will have code like this:

if (auto MaybeFile = MemoryBuffer::getFileOrSTDIN()) {
  // Do something with MaybeFile
}
// Else path doesn't care about the specific error code.

that will suddenly start crashing at runtime. I even wonder if there like that introduced in-tree by your current all-in-one patch, since I think our filesystem-error paths are often missing test coverage. (It'd be difficult to do a thorough audit.)

I thought through a potential staging that could mitigate the pain:

  1. Add using MemoryBufferCompat = MemoryBuffer and search/replace all static MemoryBuffer:: API calls over to MemoryBufferCompat::. No direct impact on downstreams.
  2. Change MemoryBuffer to use Error and Expected, leaving behind std::error_code and ErrorOr wrappers in a no-longer-just-a-typedef MemoryBufferCompat. Easy for downstreams to temporarily revert, and cherry-pick once they've finished adapting in the example of (1).
  3. Update all code to use the new APIs. Could be done all at once since it's mostly mechanical, as you said. Also an option to split up by component (e.g., maybe the llvm-symbolizer regression is okay, but it could be nice to get that reviewed separately / in isolation).
  4. Delete MemoryBufferCompat. Downstreams can temporarily revert while they follow the example of on (3).

(Given that (1) and (2) are easy to write, you already have tree state for (4), and (3) is easy to create from (4), then I think you could construct the above commit sequence without having to redo all the work... then if you wanted to split (3) up from there it'd be easy enough.)

(2) seems like the commit mostly likely to cause functional regressions, and it'd be isolated and easy to revert/reapply (downstream and/or upstream) without much churn.

WDYT?

MaskRay accepted this revision.EditedSep 7 2021, 2:25 PM

ninja clang on Windows works.

check-llvm-tools has a few REQUIRES: system-windows tests. They still passed.

This revision is now accepted and ready to land.Sep 7 2021, 2:25 PM
martong removed a subscriber: martong.Sep 8 2021, 3:00 AM

This seems like the right direction to me! Especially like the look-through-the-ErrorInfo change for FileError -- I hit this before and found it annoying.

Thanks for taking a look!

IMO, it'd be valuable to split out the consequential functional changes:

  • Improvements/changes you made to FileError could be committed ahead of time

Sure sure, can be committed and unit tested separately for sure.

  • Other improvements you discussed to avoid regressions in (e.g.) llvm-symbolizer (seems potentially important?)

I didn't think the yaml symbolizer output was that important - but turned out not to be super hard to fix, so I've done that. (were there other regressions I mentioned/should think about?)

I agree the other changes are mostly mechanical and don't all need careful review-by-subcomponents.

That said, it looks very painful for downstream clients of the LLVM C++ API since it's structured as an all-or-nothing change.

Yeah, certainly crossed my mind.

Especially for managing cherry-picks to long-lived stable branches. It's painful because clients will have code like this:

if (auto MaybeFile = MemoryBuffer::getFileOrSTDIN()) {
  // Do something with MaybeFile
}
// Else path doesn't care about the specific error code.

that will suddenly start crashing at runtime. I even wonder if there like that introduced in-tree by your current all-in-one patch, since I think our filesystem-error paths are often missing test coverage. (It'd be difficult to do a thorough audit.)

Yeah, that came up in a handful of cases that used 'auto' (without using auto it's a compile failure).

I thought through a potential staging that could mitigate the pain:

  1. Add using MemoryBufferCompat = MemoryBuffer and search/replace all static MemoryBuffer:: API calls over to MemoryBufferCompat::. No direct impact on downstreams.

Yeah, that's some of the extra churn I was thinking of/hoping to avoid - but yeah, it's probably worthwhile.

What's the benefit of having the extra step where everything's renamed twice? (if it's going to be a monolithic commit - as mentioned in (3)) Compared to doing the mass change while keeping the (1 & 2) pieces for backwards compatibility if needed?

  1. Change MemoryBuffer to use Error and Expected, leaving behind std::error_code and ErrorOr wrappers in a no-longer-just-a-typedef MemoryBufferCompat. Easy for downstreams to temporarily revert, and cherry-pick once they've finished adapting in the example of (1).
  2. Update all code to use the new APIs. Could be done all at once since it's mostly mechanical, as you said. Also an option to split up by component (e.g., maybe the llvm-symbolizer regression is okay, but it could be nice to get that reviewed separately / in isolation).
  3. Delete MemoryBufferCompat. Downstreams can temporarily revert while they follow the example of on (3).

(Given that (1) and (2) are easy to write, you already have tree state for (4), and (3) is easy to create from (4), then I think you could construct the above commit sequence without having to redo all the work... then if you wanted to split (3) up from there it'd be easy enough.)

(2) seems like the commit mostly likely to cause functional regressions, and it'd be isolated and easy to revert/reapply (downstream and/or upstream) without much churn.

(3) would be more likely to cause regression? Presumably (2) is really uninteresting because it adds a new API (re-adding MemoryBuffer, but unused since everything's using MemoryBufferCompat) without any usage, yeah?

Plausible, though a fair bit more churn - I'd probably be up for it, though.

  • Dave
dblaikie updated this revision to Diff 371462.Sep 8 2021, 3:40 PM

Address the llvm-symbolizer regression by special casing FileError and extracting the non-filename portion of the error when emitting json since the filename is conveyed separately

dblaikie updated this revision to Diff 371463.Sep 8 2021, 3:44 PM

Fix a couple of spurious letters/modifications to a couple of test files

dblaikie updated this revision to Diff 371469.Sep 8 2021, 4:18 PM

Rebase on top of pushed FileError API improvements

(were there other regressions I mentioned/should think about?)

I don't have specific concerns; I was just reading between the lines of your description...

  1. Add using MemoryBufferCompat = MemoryBuffer and search/replace all static MemoryBuffer:: API calls over to MemoryBufferCompat::. No direct impact on downstreams.

Yeah, that's some of the extra churn I was thinking of/hoping to avoid - but yeah, it's probably worthwhile.

What's the benefit of having the extra step where everything's renamed twice? (if it's going to be a monolithic commit - as mentioned in (3)) Compared to doing the mass change while keeping the (1 & 2) pieces for backwards compatibility if needed?

Benefits I was seeing of splitting (1-3) up are:

  • makes (2) easy for downstreams to integrate separately from (1) and (3) (see below for why (2) is interesting)
  • prevents any reverts of (3) on main from resulting in churn in downstream efforts to migrate in response to (1-2)
  • enables (3) to NOT be monolithic -- still debatable how useful it is, but if split up then individual pieces can run through buildbots separately (and be reverted separately)
  1. Change MemoryBuffer to use Error and Expected, leaving behind std::error_code and ErrorOr wrappers in a no-longer-just-a-typedef MemoryBufferCompat. Easy for downstreams to temporarily revert, and cherry-pick once they've finished adapting in the example of (1).
  2. Update all code to use the new APIs. Could be done all at once since it's mostly mechanical, as you said. Also an option to split up by component (e.g., maybe the llvm-symbolizer regression is okay, but it could be nice to get that reviewed separately / in isolation).
  3. Delete MemoryBufferCompat. Downstreams can temporarily revert while they follow the example of on (3).

(Given that (1) and (2) are easy to write, you already have tree state for (4), and (3) is easy to create from (4), then I think you could construct the above commit sequence without having to redo all the work... then if you wanted to split (3) up from there it'd be easy enough.)

(2) seems like the commit mostly likely to cause functional regressions, and it'd be isolated and easy to revert/reapply (downstream and/or upstream) without much churn.

(3) would be more likely to cause regression? Presumably (2) is really uninteresting because it adds a new API (re-adding MemoryBuffer, but unused since everything's using MemoryBufferCompat) without any usage, yeah?

(2) changes all downstream clients of MemoryBuffer APIs from std::error_code to Error

  • Mostly will cause build failures
  • Also runtime regressions, due to auto, etc.

The fix is to do a search/replace of MemoryBuffer:: to MemoryBufferCompat:: on only the downstream code.

  • Splitting from (1) means you can sequence this change between (1) and (2) — code always builds.
  • Splitting from (3) means you can do a simple search/replace. If (2) is packed up with (3) it seems a lot more awkward, since you have to avoid accidentally undoing (3) during the search/replace. Then if somehow (3) gets reverted you need to start over when it relands.

Plausible, though a fair bit more churn - I'd probably be up for it, though.

  • Dave

(were there other regressions I mentioned/should think about?)

I don't have specific concerns; I was just reading between the lines of your description...

Fair - probably my own hedging there.

  1. Add using MemoryBufferCompat = MemoryBuffer and search/replace all static MemoryBuffer:: API calls over to MemoryBufferCompat::. No direct impact on downstreams.

Yeah, that's some of the extra churn I was thinking of/hoping to avoid - but yeah, it's probably worthwhile.

What's the benefit of having the extra step where everything's renamed twice? (if it's going to be a monolithic commit - as mentioned in (3)) Compared to doing the mass change while keeping the (1 & 2) pieces for backwards compatibility if needed?

Benefits I was seeing of splitting (1-3) up are:

  • makes (2) easy for downstreams to integrate separately from (1) and (3) (see below for why (2) is interesting)
  • prevents any reverts of (3) on main from resulting in churn in downstream efforts to migrate in response to (1-2)
  • enables (3) to NOT be monolithic -- still debatable how useful it is, but if split up then individual pieces can run through buildbots separately (and be reverted separately)
  1. Change MemoryBuffer to use Error and Expected, leaving behind std::error_code and ErrorOr wrappers in a no-longer-just-a-typedef MemoryBufferCompat. Easy for downstreams to temporarily revert, and cherry-pick once they've finished adapting in the example of (1).
  2. Update all code to use the new APIs. Could be done all at once since it's mostly mechanical, as you said. Also an option to split up by component (e.g., maybe the llvm-symbolizer regression is okay, but it could be nice to get that reviewed separately / in isolation).
  3. Delete MemoryBufferCompat. Downstreams can temporarily revert while they follow the example of on (3).

(Given that (1) and (2) are easy to write, you already have tree state for (4), and (3) is easy to create from (4), then I think you could construct the above commit sequence without having to redo all the work... then if you wanted to split (3) up from there it'd be easy enough.)

(2) seems like the commit mostly likely to cause functional regressions, and it'd be isolated and easy to revert/reapply (downstream and/or upstream) without much churn.

(3) would be more likely to cause regression? Presumably (2) is really uninteresting because it adds a new API (re-adding MemoryBuffer, but unused since everything's using MemoryBufferCompat) without any usage, yeah?

(2) changes all downstream clients of MemoryBuffer APIs from std::error_code to Error

  • Mostly will cause build failures
  • Also runtime regressions, due to auto, etc.

Oooh, right. Good point - thanks for walking me through it!

The fix is to do a search/replace of MemoryBuffer:: to MemoryBufferCompat:: on only the downstream code.

  • Splitting from (1) means you can sequence this change between (1) and (2) — code always builds.
  • Splitting from (3) means you can do a simple search/replace. If (2) is packed up with (3) it seems a lot more awkward, since you have to avoid accidentally undoing (3) during the search/replace. Then if somehow (3) gets reverted you need to start over when it relands.

Yeah, think I'm with you.

Given the amount of churn this means, though - reckon it's worth it? Reckon it needs more llvm-dev thread/buy-in/etc? Any other bike-shedding for MemoryBufferCompat? (MemoryBufferDeprecated? but I don't really mind)

Given the amount of churn this means, though - reckon it's worth it? Reckon it needs more llvm-dev thread/buy-in/etc?

I think the churn is worth since my intuition is that it has high value for downstreams/clients (in terms of mitigating risk/etc.). (For example, the Swift compiler also uses MemoryBuffer and uses auto quite heavily.)

Not sure if this needs more buy-in, but probably worth communicating the plan on llvm-dev. If nothing else, makes it easy for relevant commits to point to it on lists.llvm.org. Could even add a working sed -e or perl command for the renames?

Any other bike-shedding for MemoryBufferCompat? (MemoryBufferDeprecated? but I don't really mind)

  • MemoryBufferDeprecated SGTM (more descriptive than "compat"), MemoryBufferLegacy would work too
  • MemoryBufferErrorCodeAPI is even more descriptive -- see related idea below
  • could also rename all the (relevant) functions instead of the class... but since these are all static anyway renaming the class feels easier?

Thinking the names through gave me an idea for a refined staging:

  1. Add MemoryBufferErrorAPI (wrapping APIs with errorOrToExpected) and MemoryBufferErrorCodeAPI (alias for MemoryBuffer) in the same commit.
  2. Migrate in-tree callers to MemoryBufferErrorCodeAPI (via mass rename). (Could even move some to MemoryBufferErrorAPI?)
  3. Update MemoryBuffer to use Error/Expected APIs, change MemoryBufferErrorAPI to an alias of it, and leave behind MemoryBufferErrorCodeAPI (wrapping APIs with expectedToErrorOr).
  4. One or more commits:
    1. Migrate in-tree callers to MemoryBuffer.
    2. Delete MemoryBufferErrorAPI alias.
  5. Delete MemoryBufferErrorCodeAPI wrappers.

The refinement is that the new (1) is better designed for cherry-picking to a stable branch:

  • Isolated to MemoryBuffer (no changes to callers), making conflict resolution trivial.
  • Downstreams / clients can migrate to MemoryBufferError without integrating the change to the default behaviour of MemoryBuffer.
  • Particularly useful for clients that want to cherry-pick/merge changes between a branch that builds against ToT LLVM and one that builds against a "stable" version that predates the transition.

The new (3) and (5) are the same as the old (2) and (4) -- isolated changes that are easy to revert.

(But if you're not convinced, I think my previous idea for staging would still be a huge help.)

  1. One or more commits:
    1. Migrate in-tree callers to MemoryBuffer.
    2. Delete MemoryBufferErrorAPI alias.
  2. Delete MemoryBufferErrorCodeAPI wrappers.

(Potentially MemoryBufferErrorAPI and MemoryBufferErrorAPI could be kept across an LLVM release branch boundary to help LLVM clients that use those... not sure how useful that'd be?)

Thanks for the suggestions/details, @dexonsmith - I've posted to llvm-dev here: https://groups.google.com/g/llvm-dev/c/m9UVRhzJvh4/m/qdd_SyPuCQAJ and will wait for some follow-up (or dead silence) before starting along probably your latter suggestion.

Thanks for the suggestions/details, @dexonsmith - I've posted to llvm-dev here: https://groups.google.com/g/llvm-dev/c/m9UVRhzJvh4/m/qdd_SyPuCQAJ and will wait for some follow-up (or dead silence) before starting along probably your latter suggestion.

SGTM, thanks David!