This is an archive of the discontinued LLVM Phabricator instance.

Frgen7a2 retitled this revision from lld patch for Linker Script UTF-8 encoding capability to lld patch for Linker Script UTF-8 BOM encoding capability .Nov 8 2019, 8:34 AM

Frgen7a2 edited the summary of this revision. (Show Details)

Out of curiosity what workflow leads to BOMs in linker scripts?

We happen to have BOMs in the linker scripts for our custom platform because they were authored in Windows (where we cross-compile from).

This doesn't seem very necessary. GNU ld rejects BOM as well. I think you can achieve cross compilability by simply removing BOM.

% xxd a.x
00000000: efbb bf45 4e54 5259 285f 6c61 6265 6c29  ...ENTRY(_label)
00000010: 0a
% ld.bfd a.o -T a.x -o a
ld.bfd:a.x:1: ignoring invalid character `\357' in expression
ld.bfd:a.x:1: ignoring invalid character `\273' in expression
ld.bfd:a.x:1: ignoring invalid character `\277' in expression

This revision now requires changes to proceed.Nov 8 2019, 10:21 AM

ld ignores the BOM, but still links. lld mistakes it for part of the next token and generates an error.

A BOM is perfectly valid at the start of a UTF-8 file (though not very useful, granted). Why not support it? Other tools under the LLVM umbrella do, e.g. clang.

In D70011#1739164, @cameron314 wrote:

ld ignores the BOM, but still links. lld mistakes it for part of the next token and generates an error.

A BOM is perfectly valid at the start of a UTF-8 file (though not very useful, granted). Why not support it? Other tools under the LLVM umbrella do, e.g. clang.

We need a good reason to support it, not the other way around. ld emitting warnings is a pretty clear signal that BOM is not a good idea in a linker script.

I disagree, ld emitting invalid character warnings is simply a good indication that it was never tested with such an input :-)

If the community thinks this bug isn't important enough to warrant the two-line fix, then so be it, I won't push the matter. No hard feelings.
This was merely an attempt to fix another instance of a common Windows-incompatibility bug that we happened to come across with a real-world linker script.

This needs a test case, and it's a good idea to upload patches with more context (use -U99999 if generating a patch via git to upload to the web interface, or just use arcanist).

In D70011#1739193, @cameron314 wrote:

I disagree, ld emitting invalid character warnings is simply a good indication that it was never tested with such an input :-)

... I actually think this is a good indication that your tests have never been verified. If you append --fatal-warnings (like the compiler driver option -Werror), ld.bfd will return 1.

If the community thinks this bug isn't important enough to warrant the two-line fix, then so be it, I won't push the matter. No hard feelings.
This was merely an attempt to fix another instance of a common Windows-incompatibility bug that we happened to come across with a real-world linker script.

From your arguments I never see the reasoning that this should be supported ;-) In some cases lld is more rigorous - that oftentimes identifies brittle constructs. For this BOM case it may be hard to argue it is brittle, but from the GNU ld warning I really can't say it should be supported.

grimar added a subscriber: grimar.Nov 11 2019, 12:37 AM

The thought "UTF-8 BOM is not useful" may just be ingrained in my mind. I probably just don't know enough of Unicode to simply say "this is going to be useful in lld". Probably raise an issue on https://sourceware.org/ and see what the GNU ld maintainers say?

Other tools under the LLVM umbrella do, e.g. clang

Please also keep in mind that clang doesn't support options such as -finput-charset -fexec-charset. Its UTF-8 support is just enough that does not make people feel sad.

mamai added a subscriber: mamai.Mar 25 2021, 11:51 AM

Revision Contents

Path

Size

lld/

ELF/

ScriptLexer.cpp

2 lines

Diff 228463

lld/ELF/ScriptLexer.cpp

Context not available.
	std::vector<StringRef> vec;	std::vector<StringRef> vec;
	mbs.push_back(mb);	mbs.push_back(mb);
	StringRef s = mb.getBuffer();	StringRef s = mb.getBuffer();
		if (s.startswith("\xEF\xBB\xBF"))
		s = s.substr(3);
	StringRef begin = s;	StringRef begin = s;

	for (;;) {	for (;;) {
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

lld patch for Linker Script UTF-8 BOM encoding capability Needs RevisionPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 228463

lld/ELF/ScriptLexer.cpp

lld patch for Linker Script UTF-8 BOM encoding capability
Needs RevisionPublic