This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
ELF/
6
InputFiles.cpp
-
Strings.h
-
Strings.cpp
-
test/ELF/
-
ELF/
1
format-binary-non-ascii.s

Differential D37331

[ELF] Prevent crash with binary inputs with non-ascii file names
ClosedPublic

Authored by jhenderson on Aug 31 2017, 5:56 AM.

Download Raw Diff

Details

Reviewers

ruiu
• rafael

Commits

rG7594c61d2d4a: [ELF] Prevent crash with binary inputs with non-ascii file names
rLLD312705: [ELF] Prevent crash with binary inputs with non-ascii file names
rL312705: [ELF] Prevent crash with binary inputs with non-ascii file names

Summary

If using --format=binary with an input file name that has one or more non-ascii characters in, LLD has undefined behaviour (it crashes on my Windows Debug build) when calling isalnum with these non-ascii characters. Instead, of calling isalnum, this patch checks against an explicit set of characters directly.

Diff Detail

Event Timeline

jhenderson created this revision.Aug 31 2017, 5:56 AM

Herald added a subscriber: emaste. · View Herald TranscriptAug 31 2017, 5:56 AM

ruiu added inline comments.Aug 31 2017, 10:39 AM

test/ELF/format-binary-non-ascii.s
5	I don't think this test is portable because handling of non-ascii characters in command line arguments can vary depending on system. I believe this file itself is encoded in UTF-8, but some system may decide to convert it to UTF-16, or simply rejects creating such file. There's probably no easy way to write a portable test, so I'd omit a test. That's unfortunate though.

I'm pretty sure embedded NUL is not allowed in a filename on any platform. That might be one way to write a portable test, if you can find a way to get one into the string here.

if you can find a way to get one into the string here

That's I believe the problem. I don't think there's a way to pass a NUL character as a filename.

Maybe by using a unicode filename whose utf-8 encoding has an embedded zero somewhere? Given that we're just processing each character as a byte instead of as a proper unicode character, that should work, no?

Ahh wait, the problem is about negative values. Still, I think the same kind of test could be used. Find a unicode character whose encoding contains a byte > 128

Find a unicode character whose encoding contains a byte > 128

Its basically any non-ASCII character. But is it portable? I mean, for example, if Windows crt converts an command line argument into UTF-16 encoding, this test will fail due to the difference of number of underscores.

In D37331#858023, @ruiu wrote:

Find a unicode character whose encoding contains a byte > 128

Its basically any non-ASCII character. But is it portable? I mean, for example, if Windows crt converts an command line argument into UTF-16 encoding, this test will fail due to the difference of number of underscores.

It seems unlikely the CRT is going to convert UTF-8 to UTF-16. More likely, depending on how lit issues the command, is that it'll interpret the the UTF-8 bytes as though it's in the user's code page. For the U.S., this will likely be Windows-1252. The British Pound sign in UTF-8 is 0xC2 0xA3. If you interpret those in Windows-1252, you'll see Â£, which I guess lld will convert to two underscores. On a non-Windows system, it'll still be two non-alphanumeric bytes, so I think the test should be fine.

Not all users have set their language to US-English, and depending on the language/encoding settings, the pound sign could be interpreted and possibly be converted to some other character (which might change the size of the string or could even cause "bad character" error?), no?

We had a discussion with Adrian and Reid, and the test should work on any locale and the pound sign will be passed to lld's main() as-is (as a two-byte UTF-8 character) because lld does not use wmain and we don't do something fancy with encoding conversions unlike clang does. So I think you can submit this test.

ELF/InputFiles.cpp
934	isalnum might be locale-aware, so it is probably not safe to use here in the first place. Doing it manually (i.e. `'a' <= S[I] <= 'z' \|\| 'A' <= S[I] <= 'Z' \|\| '0' <= S[I] <= '0'` ) seems better.

jhenderson added inline comments.Sep 1 2017, 9:02 AM

ELF/InputFiles.cpp
934	isalnum is locale aware, but as far as I can see, we never change the locale away from the default C locale, so do we need to worry about it?

ruiu added inline comments.Sep 1 2017, 12:14 PM

ELF/InputFiles.cpp
934	Not all people are using their OSes in English/US UI, and I think if system's default locale is not C, isalpha could behave differently depending on the definition of "alphabet" in that system locale.

jhenderson added inline comments.Sep 4 2017, 2:40 AM

ELF/InputFiles.cpp
934	By default, the system locale is unimportant. According to the C99 standard (I don't have a copy of the later C standard), the locale on program start up is always "C". We do not call setlocale anywhere in LLVM or LLD, so we are always in the "C" locale at the time this is called. The "C" locale means that the characters that isalnum is true for are [A-Za-z0-9]. So, I think the question should be, are we ever going to set LLD to run in a different locale to the "C" locale?

grimar added a subscriber: grimar.Sep 4 2017, 6:08 AM

ruiu added inline comments.Sep 5 2017, 3:45 PM

ELF/InputFiles.cpp
934	LLVM or lld may be used as a library, so if a main program sets a locale, it affects our code.

jhenderson added inline comments.Sep 6 2017, 4:44 AM

ELF/InputFiles.cpp
934	That's a fair point. I'll make that change.

I discovered an existing "isAlnum" function in LLD's Strings.cpp, so I have exposed this and used it instead. It does have a slight difference in behaviour to std::isalnum, because it returns true for underscore characters, but that is harmless in our case. I might consider renaming the function to make it clear it includes the underscore - maybe to isValidCIdentifierChar. What do you think?

LGTM. Thanks!

This revision is now accepted and ready to land.Sep 6 2017, 10:18 AM

Closed by commit rL312705: [ELF] Prevent crash with binary inputs with non-ascii file names (authored by jhenderson). · Explain WhySep 7 2017, 1:31 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

ELF/

	InputFiles.cpp
	InputFiles.cpp (revision 312630)

2 lines

	Strings.h
	Strings.h (revision 312630)

1 line

	Strings.cpp
	Strings.cpp (revision 312630)

4 lines

test/

ELF/

	format-binary-non-ascii.s
	format-binary-non-ascii.s (nonexistent)

15 lines

Diff 114001

ELF/InputFiles.cpp

Show First 20 Lines • Show All 925 Lines • ▼ Show 20 Lines	template <class ELFT> void BinaryFile::parse() {
Sections.push_back(Section);		Sections.push_back(Section);

// For each input file foo that is embedded to a result as a binary		// For each input file foo that is embedded to a result as a binary
// blob, we define _binary_foo_{start,end,size} symbols, so that		// blob, we define _binary_foo_{start,end,size} symbols, so that
// user programs can access blobs by name. Non-alphanumeric		// user programs can access blobs by name. Non-alphanumeric
// characters in a filename are replaced with underscore.		// characters in a filename are replaced with underscore.
std::string S = "_binary_" + MB.getBufferIdentifier().str();		std::string S = "_binary_" + MB.getBufferIdentifier().str();
for (size_t I = 0; I < S.size(); ++I)		for (size_t I = 0; I < S.size(); ++I)
if (!isalnum(S[I]))		if (!elf::isAlnum(S[I]))
		ruiuUnsubmitted Not Done Reply Inline Actions isalnum might be locale-aware, so it is probably not safe to use here in the first place. Doing it manually (i.e. `'a' <= S[I] <= 'z' \|\| 'A' <= S[I] <= 'Z' \|\| '0' <= S[I] <= '0'` ) seems better. ruiu: isalnum might be locale-aware, so it is probably not safe to use here in the first place. Doing…
		jhendersonAuthorUnsubmitted Not Done Reply Inline Actions isalnum is locale aware, but as far as I can see, we never change the locale away from the default C locale, so do we need to worry about it? jhenderson: isalnum is locale aware, but as far as I can see, we never change the locale away from the…
		ruiuUnsubmitted Not Done Reply Inline Actions Not all people are using their OSes in English/US UI, and I think if system's default locale is not C, isalpha could behave differently depending on the definition of "alphabet" in that system locale. ruiu: Not all people are using their OSes in English/US UI, and I think if system's default locale is…
		jhendersonAuthorUnsubmitted Not Done Reply Inline Actions By default, the system locale is unimportant. According to the C99 standard (I don't have a copy of the later C standard), the locale on program start up is always "C". We do not call setlocale anywhere in LLVM or LLD, so we are always in the "C" locale at the time this is called. The "C" locale means that the characters that isalnum is true for are [A-Za-z0-9]. So, I think the question should be, are we ever going to set LLD to run in a different locale to the "C" locale? jhenderson: By default, the system locale is unimportant. According to the C99 standard (I don't have a…
		ruiuUnsubmitted Not Done Reply Inline Actions LLVM or lld may be used as a library, so if a main program sets a locale, it affects our code. ruiu: LLVM or lld may be used as a library, so if a main program sets a locale, it affects our code.
		jhendersonAuthorUnsubmitted Not Done Reply Inline Actions That's a fair point. I'll make that change. jhenderson: That's a fair point. I'll make that change.
S[I] = '_';		S[I] = '_';

Symtab->addRegular<ELFT>(Saver.save(S + "_start"), STV_DEFAULT, STT_OBJECT,		Symtab->addRegular<ELFT>(Saver.save(S + "_start"), STV_DEFAULT, STT_OBJECT,
0, 0, STB_GLOBAL, Section, nullptr);		0, 0, STB_GLOBAL, Section, nullptr);
Symtab->addRegular<ELFT>(Saver.save(S + "_end"), STV_DEFAULT, STT_OBJECT,		Symtab->addRegular<ELFT>(Saver.save(S + "_end"), STV_DEFAULT, STT_OBJECT,
Data.size(), 0, STB_GLOBAL, Section, nullptr);		Data.size(), 0, STB_GLOBAL, Section, nullptr);
Symtab->addRegular<ELFT>(Saver.save(S + "_size"), STV_DEFAULT, STT_OBJECT,		Symtab->addRegular<ELFT>(Saver.save(S + "_size"), STV_DEFAULT, STT_OBJECT,
Data.size(), 0, STB_GLOBAL, nullptr, nullptr);		Data.size(), 0, STB_GLOBAL, nullptr, nullptr);
▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

ELF/Strings.h

	Show All 16 Lines
	#include "llvm/ADT/StringRef.h"			#include "llvm/ADT/StringRef.h"
	#include "llvm/Support/GlobPattern.h"			#include "llvm/Support/GlobPattern.h"
	#include <vector>			#include <vector>

	namespace lld {			namespace lld {
	namespace elf {			namespace elf {

	std::vector<uint8_t> parseHex(StringRef S);			std::vector<uint8_t> parseHex(StringRef S);
				bool isAlnum(char C);
	bool isValidCIdentifier(StringRef S);			bool isValidCIdentifier(StringRef S);

	// This is a lazy version of StringRef. String size is computed lazily			// This is a lazy version of StringRef. String size is computed lazily
	// when it is needed. It is more efficient than StringRef to instantiate			// when it is needed. It is more efficient than StringRef to instantiate
	// if you have a string whose size is unknown.			// if you have a string whose size is unknown.
	//			//
	// ELF string tables contain a lot of null-terminated strings.			// ELF string tables contain a lot of null-terminated strings.
	// Most of them are not necessary for the linker because they are names			// Most of them are not necessary for the linker because they are names
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

ELF/Strings.cpp

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	std::vector<uint8_t> elf::parseHex(StringRef S) {
}		}
return Hex;		return Hex;
}		}

static bool isAlpha(char C) {		static bool isAlpha(char C) {
return ('a' <= C && C <= 'z') \|\| ('A' <= C && C <= 'Z') \|\| C == '_';		return ('a' <= C && C <= 'z') \|\| ('A' <= C && C <= 'Z') \|\| C == '_';
}		}

static bool isAlnum(char C) { return isAlpha(C) \|\| ('0' <= C && C <= '9'); }		// Returns true if C is a valid letter, digit or underscore as defined in the
		// "C" locale.
		bool elf::isAlnum(char C) { return isAlpha(C) \|\| ('0' <= C && C <= '9'); }

// Returns true if S is valid as a C language identifier.		// Returns true if S is valid as a C language identifier.
bool elf::isValidCIdentifier(StringRef S) {		bool elf::isValidCIdentifier(StringRef S) {
return !S.empty() && isAlpha(S[0]) &&		return !S.empty() && isAlpha(S[0]) &&
std::all_of(S.begin() + 1, S.end(), isAlnum);		std::all_of(S.begin() + 1, S.end(), isAlnum);
}		}

// Returns the demangled C++ symbol name for Name.		// Returns the demangled C++ symbol name for Name.
Show All 16 Lines

test/ELF/format-binary-non-ascii.s

				# REQUIRES: x86
				# RUN: llvm-mc -filetype=obj -triple=x86_64-unknown-linux %s -o %t£.o

				# RUN: ld.lld -o %t.elf %t£.o --format=binary %t£.o
				# RUN: llvm-readobj -symbols %t.elf \| FileCheck %s
				ruiuUnsubmitted Not Done Reply Inline Actions I don't think this test is portable because handling of non-ascii characters in command line arguments can vary depending on system. I believe this file itself is encoded in UTF-8, but some system may decide to convert it to UTF-16, or simply rejects creating such file. There's probably no easy way to write a portable test, so I'd omit a test. That's unfortunate though. ruiu: I don't think this test is portable because handling of non-ascii characters in command line…

				# CHECK: Name: _binary_{{[a-zA-Z0-9_]+}}test_ELF_Output_format_binary_non_ascii_s_tmp___o_start
				# CHECK: Name: _binary_{{[a-zA-Z0-9_]+}}test_ELF_Output_format_binary_non_ascii_s_tmp___o_end
				# CHECK: Name: _binary_{{[a-zA-Z0-9_]+}}test_ELF_Output_format_binary_non_ascii_s_tmp___o_size

				.text
				.align 4
				.globl _start
				_start:
				nop