This is an archive of the discontinued LLVM Phabricator instance.

lld patch for Linker Script UTF-8 BOM encoding capability
Needs RevisionPublic

Authored by Frgen7a2 on Nov 8 2019, 8:30 AM.

Details

Summary

Allows to open and successfully parse Linker Scripts in UTF-8 BOM encoding.

Diff Detail

Repository
rLLD LLVM Linker

Event Timeline

Frgen7a2 created this revision.Nov 8 2019, 8:30 AM
Frgen7a2 retitled this revision from lld patch for Linker Script UTF-8 encoding capability to lld patch for Linker Script UTF-8 BOM encoding capability .Nov 8 2019, 8:34 AM
Frgen7a2 edited the summary of this revision. (Show Details)
emaste added a comment.Nov 8 2019, 9:05 AM

Out of curiosity what workflow leads to BOMs in linker scripts?

We happen to have BOMs in the linker scripts for our custom platform because they were authored in Windows (where we cross-compile from).

MaskRay requested changes to this revision.Nov 8 2019, 10:21 AM

This doesn't seem very necessary. GNU ld rejects BOM as well. I think you can achieve cross compilability by simply removing BOM.

% xxd a.x
00000000: efbb bf45 4e54 5259 285f 6c61 6265 6c29  ...ENTRY(_label)
00000010: 0a
% ld.bfd a.o -T a.x -o a
ld.bfd:a.x:1: ignoring invalid character `\357' in expression
ld.bfd:a.x:1: ignoring invalid character `\273' in expression
ld.bfd:a.x:1: ignoring invalid character `\277' in expression
This revision now requires changes to proceed.Nov 8 2019, 10:21 AM

ld ignores the BOM, but still links. lld mistakes it for part of the next token and generates an error.

A BOM is perfectly valid at the start of a UTF-8 file (though not very useful, granted). Why not support it? Other tools under the LLVM umbrella do, e.g. clang.

ld ignores the BOM, but still links. lld mistakes it for part of the next token and generates an error.

A BOM is perfectly valid at the start of a UTF-8 file (though not very useful, granted). Why not support it? Other tools under the LLVM umbrella do, e.g. clang.

We need a good reason to support it, not the other way around. ld emitting warnings is a pretty clear signal that BOM is not a good idea in a linker script.

I disagree, ld emitting invalid character warnings is simply a good indication that it was never tested with such an input :-)

If the community thinks this bug isn't important enough to warrant the two-line fix, then so be it, I won't push the matter. No hard feelings.
This was merely an attempt to fix another instance of a common Windows-incompatibility bug that we happened to come across with a real-world linker script.

This needs a test case, and it's a good idea to upload patches with more context (use -U99999 if generating a patch via git to upload to the web interface, or just use arcanist).

MaskRay added a comment.EditedNov 8 2019, 12:22 PM

I disagree, ld emitting invalid character warnings is simply a good indication that it was never tested with such an input :-)

... I actually think this is a good indication that your tests have never been verified. If you append --fatal-warnings (like the compiler driver option -Werror), ld.bfd will return 1.

If the community thinks this bug isn't important enough to warrant the two-line fix, then so be it, I won't push the matter. No hard feelings.
This was merely an attempt to fix another instance of a common Windows-incompatibility bug that we happened to come across with a real-world linker script.

From your arguments I never see the reasoning that this should be supported ;-) In some cases lld is more rigorous - that oftentimes identifies brittle constructs. For this BOM case it may be hard to argue it is brittle, but from the GNU ld warning I really can't say it should be supported.

grimar added a subscriber: grimar.Nov 11 2019, 12:37 AM
MaskRay added a comment.EditedNov 11 2019, 11:32 AM

The thought "UTF-8 BOM is not useful" may just be ingrained in my mind. I probably just don't know enough of Unicode to simply say "this is going to be useful in lld". Probably raise an issue on https://sourceware.org/ and see what the GNU ld maintainers say?

Other tools under the LLVM umbrella do, e.g. clang

Please also keep in mind that clang doesn't support options such as -finput-charset -fexec-charset. Its UTF-8 support is just enough that does not make people feel sad.

mamai added a subscriber: mamai.Mar 25 2021, 11:51 AM