This is an archive of the discontinued LLVM Phabricator instance.

[llvm-rc] Add rudimentary support for codepages
ClosedPublic

Authored by mstorsjo on Apr 29 2018, 12:49 PM.

Details

Summary

Only support UTF-8 (since LLVM contains UTF-8 parsing support already, and the code even does that already) and Windows-1252 (where most code points has the same value in unicode). Keep the existing default as only allowing ASCII input.

Using the option type JoinedOrSeparate, since the real rc.exe handles options in this form, even if llvm-rc uses Separate for other similar existing options.

Rename the struct SearchParams to WriterParams since it's now used for more than just include paths.

Add a missing getResourceTypeName method to the BundleResource class, to fix error printing from within STRINGTABLE resources (used in tests).

Diff Detail

Event Timeline

mstorsjo created this revision.Apr 29 2018, 12:49 PM
amccarth added inline comments.
tools/llvm-rc/ResourceFileWriter.cpp
156

s/all/most/

208

s/all/most/

tools/llvm-rc/ResourceFileWriter.h
29

Are you concerned about the name clash with the macro defined in the Windows headers? It seems like code that wants to include this header better make sure that it hasn't already included Windows headers (or that it has undefined the macros).

33

That's not _strictly_ true. Windows-1252 maps many characters in the range 0x80-0x9F that do not correspond to the "upper" control characters Unicode has at the corresponding code points. For example, Windows-1252 0x83 corresponds to Unicode U+0192 (LATIN SMALL LETTER F WITH HOOK).

It's probably true enough for practical purposes, but I'd be disappointed if this comment leads readers to not appreciate the differences.

Also, s/8 bit values/8-bit values/.

mstorsjo added inline comments.Apr 30 2018, 11:36 AM
tools/llvm-rc/ResourceFileWriter.h
29

Oh, indeed, I hadn't thought about that. I'll test how it behaves if that header is included.

33

Oh, thanks for pointing this out!

Is there any other codepage I should rather pick for the same purpose - trivial implementation in converting to unicode without external dependencies? Win-28591 aka latin1?

amccarth added inline comments.Apr 30 2018, 11:53 AM
tools/llvm-rc/ResourceFileWriter.h
33

Yes, the Latin-1 page would work as you intended and require no conversion. That said, it's probably pretty uncommon in practice.

mstorsjo updated this revision to Diff 144623.Apr 30 2018, 1:21 PM
mstorsjo edited the summary of this revision. (Show Details)

Added a small mapping table to convert the full range of Win-1252 to unicode, renamed the enum constants to avoid potential clashes with windows headers.

Looks good to me now, but I'm not an official reviewer.

It would be nice if the CP-1252 test exercised one of the unusual characters that you had to build the map for. Since the sources for the tests are stored as binary files, I'm having a hard time peeking into it from Phabricator, so maybe you did, but it's not obvious.

Looks good to me now, but I'm not an official reviewer.

It would be nice if the CP-1252 test exercised one of the unusual characters that you had to build the map for. Since the sources for the tests are stored as binary files, I'm having a hard time peeking into it from Phabricator, so maybe you did, but it's not obvious.

Yes, I used one of those chars. The test rc file for cp1252 is the following:

1 "åäö © ƒ \xe5\xe4\366 \251 \x83"
2 L"åäö © ƒ \xe5\xe4\366 \251 \x0192"

Both produce the same UTF16 string; the first half of the strings is a bunch of literal >= 0x80 values, the other half is producing the same using escape codes. The escape codes in the narrow string are interpreted according to the code page, while escape codes in the wide string map directly to their unicode values. The output of this is four times "E5 00 E4 00 F6 00 20 00 A9 00 20 00 92 01".

amccarth accepted this revision.May 2 2018, 11:42 AM

LGTM.

This revision is now accepted and ready to land.May 2 2018, 11:42 AM
This revision was automatically updated to reflect the committed changes.