If using --format=binary with an input file name that has one or more non-ascii characters in, LLD has undefined behaviour (it crashes on my Windows Debug build) when calling isalnum with these non-ascii characters. Instead, of calling isalnum, this patch checks against an explicit set of characters directly.
Details
Diff Detail
Event Timeline
test/ELF/format-binary-non-ascii.s | ||
---|---|---|
5 | I don't think this test is portable because handling of non-ascii characters in command line arguments can vary depending on system. I believe this file itself is encoded in UTF-8, but some system may decide to convert it to UTF-16, or simply rejects creating such file. There's probably no easy way to write a portable test, so I'd omit a test. That's unfortunate though. |
I'm pretty sure embedded NUL is not allowed in a filename on any platform. That might be one way to write a portable test, if you can find a way to get one into the string here.
if you can find a way to get one into the string here
That's I believe the problem. I don't think there's a way to pass a NUL character as a filename.
Maybe by using a unicode filename whose utf-8 encoding has an embedded zero somewhere? Given that we're just processing each character as a byte instead of as a proper unicode character, that should work, no?
Ahh wait, the problem is about negative values. Still, I think the same kind of test could be used. Find a unicode character whose encoding contains a byte > 128
Find a unicode character whose encoding contains a byte > 128
Its basically any non-ASCII character. But is it portable? I mean, for example, if Windows crt converts an command line argument into UTF-16 encoding, this test will fail due to the difference of number of underscores.
It seems unlikely the CRT is going to convert UTF-8 to UTF-16. More likely, depending on how lit issues the command, is that it'll interpret the the UTF-8 bytes as though it's in the user's code page. For the U.S., this will likely be Windows-1252. The British Pound sign in UTF-8 is 0xC2 0xA3. If you interpret those in Windows-1252, you'll see £, which I guess lld will convert to two underscores. On a non-Windows system, it'll still be two non-alphanumeric bytes, so I think the test should be fine.
Not all users have set their language to US-English, and depending on the language/encoding settings, the pound sign could be interpreted and possibly be converted to some other character (which might change the size of the string or could even cause "bad character" error?), no?
We had a discussion with Adrian and Reid, and the test should work on any locale and the pound sign will be passed to lld's main() as-is (as a two-byte UTF-8 character) because lld does not use wmain and we don't do something fancy with encoding conversions unlike clang does. So I think you can submit this test.
ELF/InputFiles.cpp | ||
---|---|---|
934 | isalnum might be locale-aware, so it is probably not safe to use here in the first place. Doing it manually (i.e. 'a' <= S[I] <= 'z' || 'A' <= S[I] <= 'Z' || '0' <= S[I] <= '0' ) seems better. |
ELF/InputFiles.cpp | ||
---|---|---|
934 | isalnum is locale aware, but as far as I can see, we never change the locale away from the default C locale, so do we need to worry about it? |
ELF/InputFiles.cpp | ||
---|---|---|
934 | Not all people are using their OSes in English/US UI, and I think if system's default locale is not C, isalpha could behave differently depending on the definition of "alphabet" in that system locale. |
ELF/InputFiles.cpp | ||
---|---|---|
934 | By default, the system locale is unimportant. According to the C99 standard (I don't have a copy of the later C standard), the locale on program start up is always "C". We do not call setlocale anywhere in LLVM or LLD, so we are always in the "C" locale at the time this is called. The "C" locale means that the characters that isalnum is true for are [A-Za-z0-9]. So, I think the question should be, are we ever going to set LLD to run in a different locale to the "C" locale? |
ELF/InputFiles.cpp | ||
---|---|---|
934 | LLVM or lld may be used as a library, so if a main program sets a locale, it affects our code. |
ELF/InputFiles.cpp | ||
---|---|---|
934 | That's a fair point. I'll make that change. |
I discovered an existing "isAlnum" function in LLD's Strings.cpp, so I have exposed this and used it instead. It does have a slight difference in behaviour to std::isalnum, because it returns true for underscore characters, but that is harmless in our case. I might consider renaming the function to make it clear it includes the underscore - maybe to isValidCIdentifierChar. What do you think?
isalnum might be locale-aware, so it is probably not safe to use here in the first place. Doing it manually (i.e. 'a' <= S[I] <= 'z' || 'A' <= S[I] <= 'Z' || '0' <= S[I] <= '0' ) seems better.