This is an archive of the discontinued LLVM Phabricator instance.

[symbolizer] Check existence of input file in GNU mode
ClosedPublic

Authored by sepavloff on Apr 5 2023, 11:58 AM.

Details

Summary

GNU addr2line exits immediately if it cannot open the file specified as
executable/relocatable. In contrast llvm-addr2line does not exit and, if
addresses are not specified in command line, waits for input on stdin. This
causes the test compiler-rt/test/asan/TestCases/Posix/asan-symbolize-bad-path.cc to block
forever on Gentoo (see https://reviews.llvm.org/rG27c4777f41d2ab204c1cf84ff1cccd5ba41354da#1190273).
To fix this issue the behavior llvm-addr2line now exits if
executable/relocatable file cannot be opened.

It fixes https://github.com/llvm/llvm-project/issues/42099 (llvm-addr2line
does not exit when passed a non-existent file).

Diff Detail

Event Timeline

sepavloff created this revision.Apr 5 2023, 11:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2023, 11:58 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
sepavloff requested review of this revision.Apr 5 2023, 11:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2023, 11:58 AM
jhenderson requested changes to this revision.Apr 6 2023, 12:14 AM

Thanks for the patch! I'm not convinced you've made the right change in the right place, and there are a number of other issues with this patch, which need to be addressed. See my inline comments for details.

llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
715

Can this instead return an Error to be handled up the call stack? An Error contains as much information as a std::string/bool return pairing like you're doing.

Doing this means the function boils down to:

Error LLVMSymbolizer::findBinaryFile(StringRef BinName) {
  Expected<SymbolizableModule *> BinFile = getOrCreateModuleInfo(BinName.str());
  return BinFile ? Error::success() : BinFile.takeError();
}

That then raises the question: is the place you're currently calling this the right place? I suspect not, and that you should instead by failing and reporting the error around the point where getOrCreateModuleInfo is called "for real".

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
483–484

Does llvm-addr2line already assume a.out is the input file if no input file is specified? If not, this seems like it's incorrect behaviour.

I was under the impression that if no input file is specified, it entered "interactive" mode.

487

This is a different style to how other warnings and errors are printed for llvm-symbolizer (see above e.g. line 467 in this patch).

It seems to me like a follow-up change to normalise the printing format for all warnings and errors to use WithColor, probably within reportWarning/reportError functions is in order. Are you happy to make that patch?

This revision now requires changes to proceed.Apr 6 2023, 12:14 AM
mgorny added a comment.Apr 6 2023, 7:27 AM

Do you need me to test it (it's some heavy building after all) or have you reproduced the problem? If the latter, I'd prefer to do full build of the new snapshot once the patch lands.

sepavloff added inline comments.Apr 7 2023, 1:47 AM
llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
483–484

llvm-addr2line did not assume a.out if input file was not specified. Instead it followed logic of llvm-symbolizer and expected binary file name in the input stream, provided that addresses were not specified in command line also.

I was under the impression that if no input file is specified, it entered "interactive" mode.

llvm-symbolizer and addr2line enter "interactive" mode if addresses are not specified in command line. If binary file is not specified as well, llvm-symbolizer tries reading both binary file and addresses. addr2line never reads binary file from input stream.

487

If llvm-addr2line were given wrong binary file, it printed message:

$ llvm-addr2line -e inexistent 0x12
LLVMSymbolizer: error reading file: No such file or directory
??:0

Output of GNU addr2line is different:

$ addr2line -e inexistent 0x12
addr2line: 'inexistent': No such file

With this patch output of llvm-addr2line is closer to GNU utility:

$ llvm-addr2line -e inexistent 0x12
llvm-addr2line: 'inexistent': No such file or directory

Actually output of llvm-symbolizer is not consistent. LLVMSymbolizer is a class name, it does not represent utility. Such message banner is OK for debug dumps, but for a user it is inconvenient.

It seems to me like a follow-up change to normalise the printing format for all warnings and errors to use WithColor, probably within reportWarning/reportError functions is in order. Are you happy to make that patch?

The main problem with such patch is decision about message format. It GNU-compatible message would be used, which part of the message llvm-addr2line: 'inexistent': No such file or directory would be colored red? Should the message contain text error:? Can it break some user scenarios due to changed output? Technically such patch looks simple, I could make it.

sepavloff updated this revision to Diff 511641.Apr 7 2023, 1:49 AM

Use Error as return value of findBinaryFile, add test for error message

It seems to me like there are 4(?) different issues that have come up here, and they probably need splitting into separate patches, and possibly even an RFC or two. I think the separate issues are:

  1. GNU addr2line defaults to reading from a.out if no input file is specified.
  2. If the input file can't be found, llvm-addr2line and llvm-symbolizer both enter interactive mode (but any addresses without an accompanying file will result in an error), whereas addr2line emits an error and exits.
  3. The error message returned (in interactive mode) from llvm-addr2line does not match the GNU addr2line error message.
  4. The error message is printed directly within the LLVMSymbolizer library, so has weird formatting etc. More generally, the errors printed by llvm-symbolizer are inconsistently formatted.

Feel free to correct any of the above or add more.

My opinions on the topics are:

  1. It should be easy for llvm-addr2line to default to a value (specifically "a.out") for the "-e" command-line arg, if no input file is explicitly specified. It should be possible to implement it in such a way that the rest of the code doesn't require any changes for this to just work. I don't think we want to change llvm-symbolizer behaviour here, and it's a big enough difference that the llvm-addr2line docs should be updated to note the difference.
  2. IMHO, if an explicitly specified input file does not exist (at the time of command-line parsing at least), I could get behind an error message being reported by BOTH llvm-symbolizer and llvm-addr2line during command-line parsing (or more likely, slightly after, but before the business logic of the symbolizer code kicks in). We'd still need to check for file existence as in the current library code, in case the file has been deleted since command-line parsing happened. The early check should just be a file existence check (don't try to open the file), and should exit the program in the event of failure. It might be worth raising this issue on the mailing lists to confirm changing the llvm-symbolizer behaviour too is okay with others. My reasoning for why this should impact both tools is that a user has requested an explicit file, so if it doesn't exist, trying to use it doesn't make sense, but I could see theoretical use cases where the file is created after the opening of the program in interactive mode (so hence the need for discussion).
  3. Re. error message format, I don't think an exact match of error message format should be necessary. That being said, I do think we should broadly match the error style emitted by our other binutils replacements (and certainly messages from the same tool should be formatted the same way). This goes for both llvm-symbolizer and llvm-addr2line. For example, if the input file is missing for llvm-nm, the following error is emitted on my Windows machine: "D:\llvm\build\Debug\bin\llvm-nm.exe: error: missing.bin: no such file or directory". The "error" is in red. Given we already deviate (slightly) from the GNU tools in the exact message format (GNU tools, apart from readelf, don't print the "error:" part), I think normalising with the rest of our tools makes more sense. Where there's a clear message from the OS, we should print that (along with any appropriate additional context). Where there isn't (e.g. you report an error on a missing file, without having attempted to actually open that file), it doesn't hurt to match the GNU message content, if the GNU one is sensible.
  4. My opinions on library error reporting are documented in this lightning talk: https://www.youtube.com/watch?v=YSEY4pg1YB0. It's probably a non-trivial task (albeit hopefully not too complex), but in my opinion, this is a perfect case of where a callback should be used to report failures, rather than for the library to try to handle them directly.

Would it be possible to revert the original changes then reland them with the improvements mentioned above?

Not trying to be difficult at all, it just affects our continuous downstream testing in Gentoo.

Would it be possible to revert the original changes then reland them with the improvements mentioned above?

Not trying to be difficult at all, it just affects our continuous downstream testing in Gentoo.

I've got no objections to that.

  1. IMHO, if an explicitly specified input file does not exist (at the time of command-line parsing at least), I could get behind an error message being reported by BOTH llvm-symbolizer and llvm-addr2line during command-line parsing (or more likely, slightly after, but before the business logic of the symbolizer code kicks in). We'd still need to check for file existence as in the current library code, in case the file has been deleted since command-line parsing happened. The early check should just be a file existence check (don't try to open the file), and should exit the program in the event of failure. It might be worth raising this issue on the mailing lists to confirm changing the llvm-symbolizer behaviour too is okay with others. My reasoning for why this should impact both tools is that a user has requested an explicit file, so if it doesn't exist, trying to use it doesn't make sense, but I could see theoretical use cases where the file is created after the opening of the program in interactive mode (so hence the need for discussion).

Both llvm-symbolizer and GNU addr2line report error if an explicitly specified input file does not exist. They differ in what they do after. The difference can be shown in such invocation:

addr2line -e inexistent_file

Here no addresses are specified in command line and both utilities would read them from input stream. However addr2line finds that the source file does not exists and exits immediately, no attempt to read imput stream is made. It is reasonable because -e is the only way to provide input file. llvm-symbolize in this case tries to interpret addresses from input stream or command line. It does not make sense because in this case because in this case llvm-symbolizer cannot override binary file.

It looks like llvm-symbolizer also should check input file existence early. It need a separate patch, maybe someone uses such unusual mode.

When llvm-symbolizer check existence of binary file, it creates its representation in memory. Even if the file then disappears or is changed, that does not affect decoding process, with uses the cached representation.

sepavloff updated this revision to Diff 513500.Apr 14 2023, 2:25 AM

Rebased, do not handle default a.out

It looks like llvm-symbolizer also should check input file existence early. It need a separate patch, maybe someone uses such unusual mode.

Right, my points above were all intended to be separate patches (potentially).

When llvm-symbolizer check existence of binary file, it creates its representation in memory. Even if the file then disappears or is changed, that does not affect decoding process, with uses the cached representation.

I was thinking the early check could just be a "file exists" check, and not to try to open it until later when it's actually used. That would fit better into how the code currently works.

When llvm-symbolizer check existence of binary file, it creates its representation in memory. Even if the file then disappears or is changed, that does not affect decoding process, with uses the cached representation.

I was thinking the early check could just be a "file exists" check, and not to try to open it until later when it's actually used. That would fit better into how the code currently works.

It would not solve the problem in full. A file can exists but be inaccessible for user or have incorrect format. GNU addr2line in this case exits with error but llvm-addr2line waits for input addresses. It is safer to exit early because no useful job can be done if input file is wrong. To implement the safer solution, the check for existence should be made prior to the loop that waits input from stdin.

This complexity is caused by the fact that in native mode llvm-symbolizer can read binary file name from input stream. In this mode wrong input file is not a reason to stop working - the next data in stdin can query about different file. This is why llvm-symbolizer opens input file late, after it reads input stream.

To support both modes of llvm-symbolizer the existence check should be made in two places, prior to reading standard input and in the reading loop, depending on the used mode.

This complexity is caused by the fact that in native mode llvm-symbolizer can read binary file name from input stream. In this mode wrong input file is not a reason to stop working - the next data in stdin can query about different file. This is why llvm-symbolizer opens input file late, after it reads input stream.

Perhaps the APIs could be restructured such that the file is opened, then queries run on it - then in the addr2line mode, we could open the file once, then parse and run queries, and in the llvm-symbolizer mode we could parse, open file, run query?
(presumably more invasive refactoring than the current situation, but it would be good to actually open the file that gets used for the queries, rather than doing a separate file existence/open check)

This complexity is caused by the fact that in native mode llvm-symbolizer can read binary file name from input stream. In this mode wrong input file is not a reason to stop working - the next data in stdin can query about different file. This is why llvm-symbolizer opens input file late, after it reads input stream.

Perhaps the APIs could be restructured such that the file is opened, then queries run on it - then in the addr2line mode, we could open the file once, then parse and run queries, and in the llvm-symbolizer mode we could parse, open file, run query?

Actually llvm-symbolizer works in this way with this patch. When input file is tested for existence, findBinaryFile calls getOrCreateModuleInfo, which stores the file information in LLVMSymbolizer::Modules. When later on getOrCreateModuleInfo is called again in the address translation loop, the file information is taken from the cache. So the call to findBinaryFile works as a module prefetch. No performance or memory penalty.

This complexity is caused by the fact that in native mode llvm-symbolizer can read binary file name from input stream. In this mode wrong input file is not a reason to stop working - the next data in stdin can query about different file. This is why llvm-symbolizer opens input file late, after it reads input stream.

Perhaps the APIs could be restructured such that the file is opened, then queries run on it - then in the addr2line mode, we could open the file once, then parse and run queries, and in the llvm-symbolizer mode we could parse, open file, run query?

Actually llvm-symbolizer works in this way with this patch. When input file is tested for existence, findBinaryFile calls getOrCreateModuleInfo, which stores the file information in LLVMSymbolizer::Modules. When later on getOrCreateModuleInfo is called again in the address translation loop, the file information is taken from the cache. So the call to findBinaryFile works as a module prefetch. No performance or memory penalty.

Ah, fair enough - the indirection's a bit awkward (spooky action at a distance, etc) - but at least satisfies me on the "don't do two correlated file options/expect one result to haev any meaning for a later independent operation".

sepavloff updated this revision to Diff 526007.May 26 2023, 3:17 AM

Check only file existence when checking command line argument

Apologies for the delay in coming back to this: I've been caught up with several weeks worth of high priority work with my job, and haven't had the time to look at LLVM reviews beyond a quick glance.

It looks like llvm-symbolizer also should check input file existence early. It need a separate patch, maybe someone uses such unusual mode.

Unless the changes are significant, I don't see a reason to limit this change to llvm-addr2line only. If someone has specified -e, they won't ever be able to get useful output from llvm-symbolizer.

I'm okay with the direction @dblaikie has suggested ultimately, namely that we restructure the code so that we open the file early, as long as we keep the file open thereafter (so as not to open the file twice). However, I think it could be a separate subsequent patch, so that we can get this immediate improvement added.

By the way, there is prior art for existence checking long before the file is needed. Specifically, LLD checks that certain files exist early on, so that it doesn't need the relatively slow work of the link before discovering that the file doesn't exist or the directory is invalid in some manner etc.

llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
566

Nit: have you run clang-format on this?

571–573

We have C++17 structured bindings now.

712

There's no need for doxygen comments in a .cpp file like this, as far as I know. Regardless, the comment is a bit stale: you're not opening the file here.

717

As above re. structured bindings.

llvm/test/tools/llvm-symbolizer/errors.test
2

"Inexistent" isn't a word - you probably mean "nonexistent".

It would be good to have a test case that tries to open a directory.

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
487

My strong preference is that libraries pass their error messages back up to the tool that invokes them, in one form or other (e.g. via a callback or an Error/Expected). As you say, end users don't care about the internal details like class names, so these shouldn't be in error messages.

I don't think we aim for direct GNU compatibility with error messages, in general, though a leading toolname is considered important. There is already helpful functionality that handles all of this though. See https://github.com/llvm/llvm-project/blob/d1ef99fe1ce3e2996c317600a1398933b9078260/llvm/tools/llvm-objdump/llvm-objdump.cpp#L320 for an example (specifically, look how WithColor takes a tool name as an argument).

Updated patch

sepavloff marked 5 inline comments as done.Jun 18 2023, 1:27 AM

It looks like llvm-symbolizer also should check input file existence early. It need a separate patch, maybe someone uses such unusual mode.

Unless the changes are significant, I don't see a reason to limit this change to llvm-addr2line only. If someone has specified -e, they won't ever be able to get useful output from llvm-symbolizer.

The patch is in D153219. It is not large, but there are several tests that expect processing addresses in the case of nonexisting binary files. It is not clear if these tests were added for better coverage or there are real cases for such usage. In the latter case it is better to have a separate patch.

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
487

The current implementation of error handling in llvm-symbolizer is very close to your preferences. Error messages are printed in printError(const ErrorInfoBase &EI, StringRef Path), which is also passed as a callback to Printer classes. This function is very similar to objdump::reportError with one exception: it does not call exit. llvm-symbolizer can get things like binary file path from input stream. In such case immediate exit, if the file does not exist, is not suitable, because the next portion of data can provide valid binary file. So exit is called when it is clear that no useful work can be made.

jhenderson added inline comments.Jun 19 2023, 1:09 AM
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
713

Perhaps checkFileExists or checkIsFile would be a better name? If you use one of those names, you can delete the comment entirely, as it doesn't do anything useful then.

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
486

It doesn't need to be part of this change, but I don't think it would be unreasonable to add a printError overload that takes Error instead of ErrorInfoBase and does the handleAllErrors internally. It looks like there are at least 2 separate places now that would benefit from that.

Changed function name: findBinaryFile -> checkFileExists

sepavloff marked an inline comment as done.Jun 20 2023, 11:33 AM
sepavloff added inline comments.
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
713

Indeed, thank you!

llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
486

Yes, it makes sense.

jhenderson accepted this revision.Jun 21 2023, 1:22 AM

LGTM, but would be good to get a second approval from someone else already involved in this patch before landing it.

This revision is now accepted and ready to land.Jun 21 2023, 1:22 AM
MaskRay accepted this revision.Jun 21 2023, 10:47 PM

There is a time-of-check-to-time-of-use issue that we can ignore (as we may wait on the addresses), but otherwise the new behavior (also GNU's) seems to be what a casual user may expect.

llvm/include/llvm/DebugInfo/Symbolize/Symbolize.h
193
llvm/lib/DebugInfo/Symbolize/Symbolize.cpp
714

fold the two if into one with &&

There is a time-of-check-to-time-of-use issue that we can ignore (as we may wait on the addresses),

I had the same concern there (& maybe @jhenderson did too) & claimed to be addressed/not a problem in https://reviews.llvm.org/D147652#4314914 but I didn't quite follow.

What do you mean "as we may wait on the address" - like we could deal with this problem later? Any particular reason to do that, rather than avoiding introducing the time-of-check-to-time-of-use issue in the first place?

There is a time-of-check-to-time-of-use issue that we can ignore (as we may wait on the addresses),

I had the same concern there (& maybe @jhenderson did too) & claimed to be addressed/not a problem in https://reviews.llvm.org/D147652#4314914 but I didn't quite follow.

What do you mean "as we may wait on the address" - like we could deal with this problem later? Any particular reason to do that, rather than avoiding introducing the time-of-check-to-time-of-use issue in the first place?

Apologies, my stamp was a bit hand-wavy. I saw your commend and investigated a bit more. I think D153595 may be the direction?
I don't touch llvm-symbolizer as @sepavloff can do that in D153219.

This comment was removed by jhenderson.

There is a time-of-check-to-time-of-use issue that we can ignore (as we may wait on the addresses),

I had the same concern there (& maybe @jhenderson did too) & claimed to be addressed/not a problem in https://reviews.llvm.org/D147652#4314914 but I didn't quite follow.

What do you mean "as we may wait on the address" - like we could deal with this problem later? Any particular reason to do that, rather than avoiding introducing the time-of-check-to-time-of-use issue in the first place?

Apologies, this might be all caused by me (and I figured given nobody else really pushed against it, others were fine with it). Specifically this comment:

IMHO, if an explicitly specified input file does not exist (at the time of command-line parsing at least), I could get behind an error message being reported by BOTH llvm-symbolizer and llvm-addr2line during command-line parsing (or more likely, slightly after, but before the business logic of the symbolizer code kicks in). We'd still need to check for file existence as in the current library code, in case the file has been deleted since command-line parsing happened. The early check should just be a file existence check (don't try to open the file), and should exit the program in the event of failure. It might be worth raising this issue on the mailing lists to confirm changing the llvm-symbolizer behaviour too is okay with others. My reasoning for why this should impact both tools is that a user has requested an explicit file, so if it doesn't exist, trying to use it doesn't make sense, but I could see theoretical use cases where the file is created after the opening of the program in interactive mode (so hence the need for discussion).

It was my suggestion to have an early file existence check, which didn't attempt to open the file properly, on the basis that the later code would still detect the issue at the time of use. We could then restructure the code later as @dblaikie had suggested. The early check is not too dissimilar to, I believe, a few instances in LLD, which provide early feedback on input options (also highlighted above). I don't recall why I didn't just go with the opening of the file at this point, rather than later (reading my previous comments, it was implied somewhere that it wasn't straightforward to open the file upfront for some reason).

This revision was automatically updated to reflect the committed changes.
sepavloff marked an inline comment as done.