This is an archive of the discontinued LLVM Phabricator instance.

[MC] Adjust StringTableBuilder for linked Mach-O binaries
ClosedPublic

Authored by alexander-shaposhnikov on Oct 16 2020, 9:50 AM.

Details

Summary

LD64 emits string tables which start with a space and a zero byte.
This diff adjusts StringTableBuilder for linked Mach-O binaries to match LD64's behavior.

Test plan: make check-all

Diff Detail

Event Timeline

alexander-shaposhnikov created this object with visibility "All Users".
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: hiraditya. · View Herald Transcript
alexander-shaposhnikov requested review of this revision.Oct 16 2020, 9:50 AM
mtrent added inline comments.Oct 16 2020, 11:04 AM
llvm/lib/MC/StringTableBuilder.cpp
171

This is only appropriate for 64-bit architectures. 32-bit architectures, such as arm64_32, should continue to use 4 byte alignment.

Add more tests, 32-bit/64-bit

mtrent accepted this revision.Oct 17 2020, 4:46 PM

Looks good to me!

This revision is now accepted and ready to land.Oct 17 2020, 4:46 PM

@mtrent, do you have the historical context for why ld64 adds a leading space to the string table? The relevant lines from its code are:

// burn first byte of string pool (so zero is never a valid string offset)
_currentBuffer[_currentBufferUsed++] = ' ';
// make offset 1 always point to an empty string
_currentBuffer[_currentBufferUsed++] = '\0';

However, the Mach-O documentation says:

A union that holds an index into the string table, n_strx. To specify an empty string (""), set this value to 0.

which suggests that the first byte of the string table should be the 0 byte and not the space.

mach-o/nlist.h is a bit more ambiguous. It says:

/*
 * Symbols with a index into the string table of zero (n_un.n_strx == 0) are
 * defined to have a null, "", name.  Therefore all string indexes to non null
 * names must not have a zero string index.  This is bit historical information
 * that has never been well documented.
 */

which could be interpreted as either "an n_strx of 0 should mean the empty string" (as above), or "an n_strx of 0 will always be interpreted as the empty string regardless of the string table's contents" (in which case ld64 is free to put what it wants as the first byte of the string table).

Is ld64 trying to reserve an n_strx of 0 as invalid in some way and distinguish it from an n_strx of 1 meaning the empty string?

smeenai added a subscriber: int3.Oct 19 2020, 7:31 PM
alexander-shaposhnikov changed the visibility from "All Users" to "Public (No Login Required)".Oct 20 2020, 8:07 PM

@mtrent, do you have the historical context for why ld64 adds a leading space to the string table?

Yes.

The actual details are locked in an office in a building I don't have access to because of global pandemic. But I can give you the gist with citations. I'm also not the right person to ask, but I'm the person most likely to answer this question, so, hey, maybe I am!

However, the Mach-O documentation says:

A union that holds an index into the string table, n_strx. To specify an empty string (""), set this value to 0.

The Mach-O file format descends in part from the UNIX a.out file format. Both (the new defunct) ld and ld64 agree on the historical definition of n_strx as defined in the historical a.out(5) manpage. I believe the original BSD 4.2, 4.3, 4.4 man pages had this definition, and here is where I have to get a little hand-wavy. But this link from the Internet has text that matches my memory of that historical man content (and note that this differs from newer versions of the a.out man page):

http://man.cat-v.org/unix_8th/5/a.out

In the a.out file a symbol's n_un.n_strx field gives an
index into the string table.  A n_strx value of 0 indicates
that no name is associated with a particular symbol table
entry.  The field n_un.n_name can be used to refer to the
symbol name only if the program sets this up using n_strx
and appropriate data from the string table.

Let me call your attention to this sentence:

A n_strx value of 0 indicates that no name is associated with a particular symbol table entry.

A strict reading of this sentence is: If you encounter a n_strx value of 0, there is no string for this value, therefore do not consult the strings table.

Kevin's ld linker agreed with this position, a n_strx == 0 means don't look at the string table, but as a matter of mercy also wrote a \0 byte at the front of the string table, in case someone accidentally dereferenced the string table anyway. If they accidentally tried to read a string from index 0 they'd get back a null terminated string, instead of crashing or getting whatever string happened to be the first item in the string table.

Nick's ld64 linker agreed with this position, a n_strx == 0 means don't look at the string table, but thought "if you get a null string back, that might not be a strong enough indication that you are reading an illegal value from the string table." Consider the case where nm prints an address and then prints the name of the symbol at that address. If you get back "" no name will be printed so there's no indication something went wrong. But if you got back something that wasn't "" you have a chance of noticing something bad has happened. So instead of writing a \0 byte at the front of the string table, ld64 writes a " \0" string at the front of the string table. Then everyone who writes code that walks the nlist can have a unit test that tests for strings that say " " and then fail their test: you're not allowed to look at the string table if n_strx is 0, so if your symbol name is " " you have a bug in your program. Of course, no one actually writes this test, and every Mach-O binary has an extra byte in it. But that's what it's there for. Why Nick chose ' ' and not a visible character that isn't a valid symbol name like '*' or '~' I have no idea.

The thing both ld and l64 have in common is both agree the value of a string at index 0 in the string table is undefined. And that's the most important takeaway. ld64 chooses to write a non-trivial string at this location to help you find your bugs. ld chooses to write an empty string at this location out of a sense of mercy. Neither are wrong. You could write a linker that stores "YOU HAVE A BUG IN YOUR PROGRAM" at the start of the strings table and your linker would not be wrong.

which suggests that the first byte of the string table should be the 0 byte and not the space.

I'm not familiar with the documentation you are citing. It sounds like it's describing the implications of the original ld behavior, but not it's reasoning.

mach-o/nlist.h is a bit more ambiguous. It says:

/*
 * Symbols with a index into the string table of zero (n_un.n_strx == 0) are
 * defined to have a null, "", name.  Therefore all string indexes to non null
 * names must not have a zero string index.  This is bit historical information
 * that has never been well documented.
 */

This comment documents the original ld behavior. It has been known to happen that ld64 has changed the format of the Mach-O binary without updating the appropriate header files or man pages. This might be one of those situations.

@mtrent, do you have the historical context for why ld64 adds a leading space to the string table?

Yes.

The actual details are locked in an office in a building I don't have access to because of global pandemic. But I can give you the gist with citations. I'm also not the right person to ask, but I'm the person most likely to answer this question, so, hey, maybe I am!

However, the Mach-O documentation says:

A union that holds an index into the string table, n_strx. To specify an empty string (""), set this value to 0.

The Mach-O file format descends in part from the UNIX a.out file format. Both (the new defunct) ld and ld64 agree on the historical definition of n_strx as defined in the historical a.out(5) manpage. I believe the original BSD 4.2, 4.3, 4.4 man pages had this definition, and here is where I have to get a little hand-wavy. But this link from the Internet has text that matches my memory of that historical man content (and note that this differs from newer versions of the a.out man page):

http://man.cat-v.org/unix_8th/5/a.out

In the a.out file a symbol's n_un.n_strx field gives an
index into the string table.  A n_strx value of 0 indicates
that no name is associated with a particular symbol table
entry.  The field n_un.n_name can be used to refer to the
symbol name only if the program sets this up using n_strx
and appropriate data from the string table.

Let me call your attention to this sentence:

A n_strx value of 0 indicates that no name is associated with a particular symbol table entry.

A strict reading of this sentence is: If you encounter a n_strx value of 0, there is no string for this value, therefore do not consult the strings table.

Kevin's ld linker agreed with this position, a n_strx == 0 means don't look at the string table, but as a matter of mercy also wrote a \0 byte at the front of the string table, in case someone accidentally dereferenced the string table anyway. If they accidentally tried to read a string from index 0 they'd get back a null terminated string, instead of crashing or getting whatever string happened to be the first item in the string table.

Nick's ld64 linker agreed with this position, a n_strx == 0 means don't look at the string table, but thought "if you get a null string back, that might not be a strong enough indication that you are reading an illegal value from the string table." Consider the case where nm prints an address and then prints the name of the symbol at that address. If you get back "" no name will be printed so there's no indication something went wrong. But if you got back something that wasn't "" you have a chance of noticing something bad has happened. So instead of writing a \0 byte at the front of the string table, ld64 writes a " \0" string at the front of the string table. Then everyone who writes code that walks the nlist can have a unit test that tests for strings that say " " and then fail their test: you're not allowed to look at the string table if n_strx is 0, so if your symbol name is " " you have a bug in your program. Of course, no one actually writes this test, and every Mach-O binary has an extra byte in it. But that's what it's there for. Why Nick chose ' ' and not a visible character that isn't a valid symbol name like '*' or '~' I have no idea.

The thing both ld and l64 have in common is both agree the value of a string at index 0 in the string table is undefined. And that's the most important takeaway. ld64 chooses to write a non-trivial string at this location to help you find your bugs. ld chooses to write an empty string at this location out of a sense of mercy. Neither are wrong. You could write a linker that stores "YOU HAVE A BUG IN YOUR PROGRAM" at the start of the strings table and your linker would not be wrong.

which suggests that the first byte of the string table should be the 0 byte and not the space.

I'm not familiar with the documentation you are citing. It sounds like it's describing the implications of the original ld behavior, but not it's reasoning.

mach-o/nlist.h is a bit more ambiguous. It says:

/*
 * Symbols with a index into the string table of zero (n_un.n_strx == 0) are
 * defined to have a null, "", name.  Therefore all string indexes to non null
 * names must not have a zero string index.  This is bit historical information
 * that has never been well documented.
 */

This comment documents the original ld behavior. It has been known to happen that ld64 has changed the format of the Mach-O binary without updating the appropriate header files or man pages. This might be one of those situations.

Thank you for all the details! It's good to understand the reasoning, and also super interesting to learn about :)

I was also curious because we had to decide how to lay out the string table in LLD for Mach-O, but it seems emulating ld64's behavior might be the easiest thing to do (and D89639 will do so).