[ELF] Prevent crash with binary inputs with non-ascii file names
ClosedPublic

Authored by jhenderson on Aug 31 2017, 5:56 AM.

Details

Summary

If using --format=binary with an input file name that has one or more non-ascii characters in, LLD has undefined behaviour (it crashes on my Windows Debug build) when calling isalnum with these non-ascii characters. Instead, of calling isalnum, this patch checks against an explicit set of characters directly.

Diff Detail

Repository
rL LLVM
jhenderson created this revision.Aug 31 2017, 5:56 AM
ruiu added inline comments.Aug 31 2017, 10:39 AM
test/ELF/format-binary-non-ascii.s
4 ↗(On Diff #113391)

I don't think this test is portable because handling of non-ascii characters in command line arguments can vary depending on system. I believe this file itself is encoded in UTF-8, but some system may decide to convert it to UTF-16, or simply rejects creating such file.

There's probably no easy way to write a portable test, so I'd omit a test. That's unfortunate though.

I'm pretty sure embedded NUL is not allowed in a filename on any platform. That might be one way to write a portable test, if you can find a way to get one into the string here.

ruiu added a comment.Aug 31 2017, 10:49 AM

if you can find a way to get one into the string here

That's I believe the problem. I don't think there's a way to pass a NUL character as a filename.

Maybe by using a unicode filename whose utf-8 encoding has an embedded zero somewhere? Given that we're just processing each character as a byte instead of as a proper unicode character, that should work, no?

Ahh wait, the problem is about negative values. Still, I think the same kind of test could be used. Find a unicode character whose encoding contains a byte > 128

ruiu added a comment.Aug 31 2017, 11:49 AM

Find a unicode character whose encoding contains a byte > 128

Its basically any non-ASCII character. But is it portable? I mean, for example, if Windows crt converts an command line argument into UTF-16 encoding, this test will fail due to the difference of number of underscores.

In D37331#858023, @ruiu wrote:

Find a unicode character whose encoding contains a byte > 128

Its basically any non-ASCII character. But is it portable? I mean, for example, if Windows crt converts an command line argument into UTF-16 encoding, this test will fail due to the difference of number of underscores.

It seems unlikely the CRT is going to convert UTF-8 to UTF-16. More likely, depending on how lit issues the command, is that it'll interpret the the UTF-8 bytes as though it's in the user's code page. For the U.S., this will likely be Windows-1252. The British Pound sign in UTF-8 is 0xC2 0xA3. If you interpret those in Windows-1252, you'll see £, which I guess lld will convert to two underscores. On a non-Windows system, it'll still be two non-alphanumeric bytes, so I think the test should be fine.

ruiu added a comment.Aug 31 2017, 1:27 PM

Not all users have set their language to US-English, and depending on the language/encoding settings, the pound sign could be interpreted and possibly be converted to some other character (which might change the size of the string or could even cause "bad character" error?), no?

ruiu added a comment.Aug 31 2017, 1:54 PM

We had a discussion with Adrian and Reid, and the test should work on any locale and the pound sign will be passed to lld's main() as-is (as a two-byte UTF-8 character) because lld does not use wmain and we don't do something fancy with encoding conversions unlike clang does. So I think you can submit this test.

ELF/InputFiles.cpp
934 ↗(On Diff #113391)

isalnum might be locale-aware, so it is probably not safe to use here in the first place. Doing it manually (i.e. 'a' <= S[I] <= 'z' || 'A' <= S[I] <= 'Z' || '0' <= S[I] <= '0' ) seems better.

jhenderson added inline comments.Sep 1 2017, 9:02 AM
ELF/InputFiles.cpp
934 ↗(On Diff #113391)

isalnum is locale aware, but as far as I can see, we never change the locale away from the default C locale, so do we need to worry about it?

ruiu added inline comments.Sep 1 2017, 12:14 PM
ELF/InputFiles.cpp
934 ↗(On Diff #113391)

Not all people are using their OSes in English/US UI, and I think if system's default locale is not C, isalpha could behave differently depending on the definition of "alphabet" in that system locale.

jhenderson added inline comments.Sep 4 2017, 2:40 AM
ELF/InputFiles.cpp
934 ↗(On Diff #113391)

By default, the system locale is unimportant. According to the C99 standard (I don't have a copy of the later C standard), the locale on program start up is always "C". We do not call setlocale anywhere in LLVM or LLD, so we are always in the "C" locale at the time this is called. The "C" locale means that the characters that isalnum is true for are [A-Za-z0-9].

So, I think the question should be, are we ever going to set LLD to run in a different locale to the "C" locale?

grimar added a subscriber: grimar.Sep 4 2017, 6:08 AM
ruiu added inline comments.Sep 5 2017, 3:45 PM
ELF/InputFiles.cpp
934 ↗(On Diff #113391)

LLVM or lld may be used as a library, so if a main program sets a locale, it affects our code.

jhenderson added inline comments.Sep 6 2017, 4:44 AM
ELF/InputFiles.cpp
934 ↗(On Diff #113391)

That's a fair point. I'll make that change.

jhenderson updated this revision to Diff 114001.Sep 6 2017, 6:32 AM
jhenderson edited the summary of this revision. (Show Details)

I discovered an existing "isAlnum" function in LLD's Strings.cpp, so I have exposed this and used it instead. It does have a slight difference in behaviour to std::isalnum, because it returns true for underscore characters, but that is harmless in our case. I might consider renaming the function to make it clear it includes the underscore - maybe to isValidCIdentifierChar. What do you think?

ruiu accepted this revision.Sep 6 2017, 10:18 AM

LGTM. Thanks!

This revision is now accepted and ready to land.Sep 6 2017, 10:18 AM
This revision was automatically updated to reflect the committed changes.