This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objcopy] Use getNumberOfSymbols() instead of getRawNumberOfSymbols()
AcceptedPublic

Authored by DaanDeMeyer on Dec 19 2022, 4:04 AM.

Details

Summary

getRawNumberOfSymbols() assumes that a symbol table exists, which isn't
always guaranteed, while getNumberOfSymbols() handles and tolerates objects
without a symbol table. When there is a symbol table, both methods return
the same value.

Also add a test to ensure we don't regress in this regard. The test
generates a basic COFF object with symbols and overrides the symbol table
pointer with zeros to craft the input required to verify llvm-objcopy works
as expected in this scenario.

Diff Detail

Event Timeline

DaanDeMeyer created this revision.Dec 19 2022, 4:04 AM
DaanDeMeyer requested review of this revision.Dec 19 2022, 4:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 19 2022, 4:04 AM
DaanDeMeyer edited the summary of this revision. (Show Details)Dec 19 2022, 4:04 AM
DaanDeMeyer added a reviewer: mstorsjo.
mstorsjo requested changes to this revision.Dec 19 2022, 4:16 AM

The first thing we do in the for loop is call getSymbol() which checks if the index is greater than or equal to getRawNumberOfSymbols() and returns an error if that's the case. Given the above, it makes sense to loop over getRawNumberOfSymbols() instead of over getNumberOfSymbols() since the value of the former may differ from the latter if the symbol table could not be found.

This sounds like an argument saying the existing code is correct, no?

This patch allows using llvm-objcopy to add sections to a aarch64 PE executable on x86-64. Before, it failed with an error indicating it failed to parse the PE binary.

What architecture you're running the tools on should really not matter at all here.

This patch is both lacking a proper explanation of what is going wrong currently and how this patch fixes it, and a testcase that could verify that the testcase doesn't regress later. Or even a link to an original bug report with the original issue so that someone else can verify and figure out what's going on.

From reading COFFObjectFile::getNumberOfSymbols() and COFFObjectFile::getRawNumberOfSymbols(), the only difference I see would be if we're operating on an object file without a symbol table - where the former nicely returns 0 while the latter hits llvm_unreachable? What's the actual situation you've got at hand here?

This revision now requires changes to proceed.Dec 19 2022, 4:16 AM
DaanDeMeyer edited the summary of this revision. (Show Details)Dec 19 2022, 4:45 AM
DaanDeMeyer added a comment.EditedDec 19 2022, 4:49 AM

This sounds like an argument saying the existing code is correct, no?

My bad, I switched the order of getRawNumberOfSymbols() and getNumberOfSymbols() around.

From reading COFFObjectFile::getNumberOfSymbols() and COFFObjectFile::getRawNumberOfSymbols(), the only difference I see would be if we're operating on an object file without a symbol table - where the former nicely returns 0 while the latter hits llvm_unreachable? What's the actual situation you've got at hand here?

When running llvm-objcopy built from source on an AARCH64 EFI executable built from source without this patch, I get the following error:

➜  llvm-project git:(main) build/bin/llvm-objcopy linuxaa64.efi.stub abc
build/bin/llvm-objcopy: error: 'linuxaa64.efi.stub': Invalid data was encountered while parsing the file

I looked into the error, and the root cause of the error is that getRawNumberOfSymbols() returns 1 while getNumberOfSymbols() returns 0, presumably due to a missing symbol table. Regardless of what's causing the missing symbol table, it seems prudent to not try to read symbols that don't exist, which is why I'm proposing the following patch.

We know that the loop will only succeed if getRawNumberOfSymbols() <= getNumberOfSymbols(), so unless we want the failure to happen, it seems better to loop over getNumberOfSymbols() rather than getRawNumberOfSymbols()

This sounds like an argument saying the existing code is correct, no?

My bad, I switched the order of getRawNumberOfSymbols() and getNumberOfSymbols() around.

The summary still mentions "if the index is greater than or equal to getRawNumberOfSymbols()" which I also presume should reference the other function?

From reading COFFObjectFile::getNumberOfSymbols() and COFFObjectFile::getRawNumberOfSymbols(), the only difference I see would be if we're operating on an object file without a symbol table - where the former nicely returns 0 while the latter hits llvm_unreachable? What's the actual situation you've got at hand here?

When running llvm-objcopy built from source on an AARCH64 EFI executable built from source without this patch, I get the following error:

➜  llvm-project git:(main) build/bin/llvm-objcopy linuxaa64.efi.stub abc
build/bin/llvm-objcopy: error: 'linuxaa64.efi.stub': Invalid data was encountered while parsing the file

I looked into the error, and the root cause of the error is that getRawNumberOfSymbols() returns 1 while getNumberOfSymbols() returns 0, presumably due to a missing symbol table. Regardless of what's causing the missing symbol table, it seems prudent to not try to read symbols that don't exist, which is why I'm proposing the following patch.

We know that the loop will only succeed if getRawNumberOfSymbols() <= getNumberOfSymbols(), so unless we want the failure to happen, it seems better to loop over getNumberOfSymbols() rather than getRawNumberOfSymbols()

Well the reasoning is slightly off here. If the symbol table is missing, getRawNumberOfSymbols() doesn't return 1, it should hit a llvm_unreachable, which means the compiler didn't even necessarily generate code for that case, so it could literally crash, return anything, etc. In a build with asserts enabled, it would be a hard fault directly at that point.

Anyway, the patch probably is correct that this should be changed into getNumberOfSymbols(). The current reasoning in the patch is quite hard to follow though - something like this would be more straightforward:

getRawNumberOfSymbols() assumes that a symbol table does exist - while getNumberOfSymbols() does handle and tolerate the case when there's no symbol table at all. (For the cases when there is a symbol table, they do return the same value.)

That makes it much clearer - it's not about iterating over a bigger or smaller range of values, it's about the fact that we can't call getRawNumberOfSymbols() when there's no symbol table.

But it would be great to have a real live source binary to inspect - can you share one? Also the change _does_ need a testcase added/changed that triggers this case. After building, you can run e.g. bin/llvm-lit -s path/to/llvm-project/llvm/test/tools/llvm-objcopy which will run all the testcases. Most of these testcases run with minimal synthetic input generated from yaml files though - we don't generally want to commit full real binaries there.

So to generate a testcase, you need to look into whether a binary can be generated with LLVM's yaml2obj, place a minimal such input in llvm/test/tools/llvm-objcopy/COFF/Inputs and write a test file in llvm/test/tools/llvm-objcopy/COFF which uses that, which would trigger the issue before the fix is applied, and which runs correctly with the fix in place.

Here's the offending file:

If we look at the PE header, we can see that the symbol table is 0x0000 while the symbol count is 1. This executable is systemd's EFI stub built on aarch64 using gnu-efi (build file can be found here: https://github.com/systemd/systemd/blob/main/src/boot/efi/meson.build).

➜  llvm-project git:(coff-symbols) xxd linuxaa64.efi.stub | head
00000000: 4d5a 0000 0000 0000 0000 0000 0000 0000  MZ..............
00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 4000 0000  ............@...
00000040: 5045 0000 64aa 0200 0000 0000 0000 0000  PE..d...........
00000050: 0100 0000 a000 0602 0b02 0214 00c0 0000  ................
00000060: 001e 0000 0000 0000 0010 0000 0010 0000  ................
00000070: 0000 0000 0000 0000 0010 0000 0002 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 00ee 0000 0010 0000 0000 0000 0a00 0000  ................

I'll address the rest of the issues later today. Thanks for taking the time to review!

DaanDeMeyer edited the summary of this revision. (Show Details)

Updated summary and added a test. The test fails on the main branch and passes with the changes made in this diff.

Thanks for the update - this mostly looks good to me. Using python to overwrite parts of the object file looks neat! Unfortunately, it seems to have failed in the Windows pre-merge tests. The output from that seems to be this:

FAIL: LLVM :: tools/llvm-objcopy/COFF/no-symbol-table.test (43164 of 45770)
******************** TEST 'LLVM :: tools/llvm-objcopy/COFF/no-symbol-table.test' FAILED ********************
Script:
--
: 'RUN: at line 4';   c:\ws\w3\llvm-project\premerge-checks\build\bin\yaml2obj.exe C:\ws\w3\llvm-project\premerge-checks\llvm\test\tools\llvm-objcopy\COFF/Inputs/no-symbol-table.yaml -o C:\ws\w3\llvm-project\premerge-checks\build\test\tools\llvm-objcopy\COFF\Output\no-symbol-table.test.tmp.obj
: 'RUN: at line 7';   "C:\Python39\python.exe" -c "with open('C:\ws\w3\llvm-project\premerge-checks\build\test\tools\llvm-objcopy\COFF\Output\no-symbol-table.test.tmp.obj', 'r+b') as input: input.seek(8); input.write(b'\x00' * 4)"
: 'RUN: at line 9';   c:\ws\w3\llvm-project\premerge-checks\build\bin\llvm-objcopy.exe C:\ws\w3\llvm-project\premerge-checks\build\test\tools\llvm-objcopy\COFF\Output\no-symbol-table.test.tmp.obj
--
Exit Code: 1

Command Output (stdout):
--
$ ":" "RUN: at line 4"
$ "c:\ws\w3\llvm-project\premerge-checks\build\bin\yaml2obj.exe" "C:\ws\w3\llvm-project\premerge-checks\llvm\test\tools\llvm-objcopy\COFF/Inputs/no-symbol-table.yaml" "-o" "C:\ws\w3\llvm-project\premerge-checks\build\test\tools\llvm-objcopy\COFF\Output\no-symbol-table.test.tmp.obj"
$ ":" "RUN: at line 7"
$ "C:\Python39\python.exe" "-c" "with open('C:\ws\w3\llvm-project\premerge-checks\build\test\tools\llvm-objcopy\COFF\Output\no-symbol-table.test.tmp.obj', 'r+b') as input: input.seek(8); input.write(b'\x00' * 4)"
# command stderr:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
OSError: [Errno 22] Invalid argument: 'C:\\ws\\w3\\llvm-project\\premerge-checks\x08uild\test\tools\\llvm-objcopy\\COFF\\Output\no-symbol-table.test.tmp.obj'

error: command failed with exit status: 1

--

********************

A common comment on testcases like this, is that it's bit odd to have a test that only runs a command but doesn't have any actual checks afterwards. Maybe a simple check to see that the output has zero symbols would add some more functional aspect to the test too?

Added a few checks to the test and modified the python open() command to use a raw string to hopefully fix the test on Windows

mstorsjo accepted this revision.Jan 2 2023, 4:06 AM

Thanks, this LGTM now!

Do you have commit access, or do you want me to commit it for you? In the latter case, what's your preferred git author line for the commit, i.e. Real Name <email@address>?

This revision is now accepted and ready to land.Jan 2 2023, 4:06 AM

I have commit access and just pushed the commit to the main branch. Thanks for taking the time to review!

I have commit access and just pushed the commit to the main branch. Thanks for taking the time to review!

When pushing commits, remember to add the Differential Revision: line, to allow mapping from the commit to the review, and for automatically closing the review.

MaskRay added inline comments.Jan 2 2023, 11:37 AM
llvm/test/tools/llvm-objcopy/COFF/Inputs/no-symbol-table.yaml
1

This should be moved to no-symbol-table.test to avoid an extra file.