This is an archive of the discontinued LLVM Phabricator instance.

[llvm-rc] Allow dashes as part of resource name strings
ClosedPublic

Authored by mstorsjo on Jul 22 2021, 2:42 PM.

Details

Summary

This matches what MS rc.exe allows in practice. I'm not aware of
any legal syntax case that are broken by allowing dashes as part
of what the tokenizer considers an Identifier - but I'm not
very well versed in the RC syntax either, can @amccarth think of
any case that would be broken by this?

This fixes downstream bug
https://github.com/msys2/MINGW-packages/issues/9180.

Additionally, rc.exe allows such resource name strings to be surrounded
by quotes, ending up with e.g.

Resource name (string): "QUOTEDNAME"

(i.e., the quotes end up as part of the string), which llvm-rc doesn't
support yet either. (I'm not aware of such cases in the wild though,
but resource string names with dashes do exist.)

This also allows including files with unquoted paths, with filenames
containing dashes (which fixes
https://github.com/msys2/MINGW-packages/issues/9130, which has been
worked around differently so far).

Diff Detail

Event Timeline

mstorsjo requested review of this revision.Jul 22 2021, 2:42 PM
mstorsjo created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2021, 2:42 PM

Cool. I'd never seen anyone use strings that don't look like a C or C++ identifier as a resource identifier, but it should work given that FindResource just uses the string as an opaque key.

If we maximally munch everywhere, we could consume an expression as a single identifier. For example:

#define WIDTH 400
#define HEIGHT 300
#define MARGIN 8

MyDlg DIALOGBOXEX 10, 10, WIDTH, HEIGHT
BEGIN
  LTEXT "Text in dialog", 4, 4, WIDTH-MARGIN, HEIGHT-MARGIN
END

In this context, WIDTH-MARGIN better be recognized as three tokens representing an expression rather than as a single (undefined) identifier. I know there's code out there like that.

Also, we'd better not allow these types of identifiers to be defined, i.e., don't allow #define FOO-BAR 42.

If you add those examples as test cases and they do the right thing, I'd be happy with this change.

My guess is that RC.EXE is implemented as a recursive descent parser that doesn't really have a uniform concept of "tokens." When it's expecting a "numeric or string identifier," it applies one set of munching rules, but when it's expecting an expression, it applies a different set. I haven't looked closely enough to see whether llvm-rc has that kind of flexibility at tokenization.

Cool. I'd never seen anyone use strings that don't look like a C or C++ identifier as a resource identifier, but it should work given that FindResource just uses the string as an opaque key.

Yep. This seems to be used by some project in the form ice40/chipdb-384.bin RCDATA "path/to/ice40/chipdb-384.bin" (see https://github.com/msys2/MINGW-packages/issues/9180). (Also some projects include filenames without quotes, e.g. 40 RCDATA unquoted-filename-with-dashes.ico as one identifier.)

If we maximally munch everywhere, we could consume an expression as a single identifier. For example:

#define WIDTH 400
#define HEIGHT 300
#define MARGIN 8

MyDlg DIALOGBOXEX 10, 10, WIDTH, HEIGHT
BEGIN
  LTEXT "Text in dialog", 4, 4, WIDTH-MARGIN, HEIGHT-MARGIN
END

In this context, WIDTH-MARGIN better be recognized as three tokens representing an expression rather than as a single (undefined) identifier. I know there's code out there like that.

Yes, but in this case, WIDTH-MARGIN would be expanded into 400-8 by the preprocessor, so llvm-rc's own parser would never see that expression in that form, only in the numeric form.

So if it's only possible at the preprocessor level to have expressions with named constants, but not on the RC level itself, then I think this change should be safe.

Also, we'd better not allow these types of identifiers to be defined, i.e., don't allow #define FOO-BAR 42.

If you add those examples as test cases and they do the right thing, I'd be happy with this change.

Unfortunately, the main llvm-rc tests are on output from the preprocessor, as clang isn't available as a dependency for anything under the llvm repo. The converse is possible, we can have a test under clang that invokes llvm-rc including preprocessing. We have some very small tests of that, invoking llvm-rc just to see that the preprocessing works though. I'm a bit unsure whether we should add more llvm-rc specific testcases in clang, as the preprocessor output in this case simply is 400-8.

My guess is that RC.EXE is implemented as a recursive descent parser that doesn't really have a uniform concept of "tokens." When it's expecting a "numeric or string identifier," it applies one set of munching rules, but when it's expecting an expression, it applies a different set.

Yup, I get that feeling too.

I haven't looked closely enough to see whether llvm-rc has that kind of flexibility at tokenization.

Not sure; llvm-rc first tokenizes the whole input, and then reads the feed of tokens. There's some different functions that expect different kinds of tokens depending on the context though, but e.g. for this case, I much prefer a single Identifier token to include all of this, instead of having a higher level logic that might try to fuse multiple tokens under some specific circumstances.

amccarth accepted this revision.Jul 23 2021, 10:03 AM

I see. I didn't realize llvm-rc used an actual separate preprocessor step. (RC has a crude one built-in that doesn't quite match any of the standards.)

Given external preprocessing, I can't think of any other problems that allowing - in identifiers would introduce, so this LGTM.

This revision is now accepted and ready to land.Jul 23 2021, 10:03 AM

Yep. This seems to be used by some project in the form ice40/chipdb-384.bin RCDATA "path/to/ice40/chipdb-384.bin" (see https://github.com/msys2/MINGW-packages/issues/9180). (Also some projects include filenames without quotes, e.g. 40 RCDATA unquoted-filename-with-dashes.ico as one identifier.)

It's even worse than that because there is no whitespace/newlines between entries: ice40/chipdb-384.bin RCDATA "T:/mingw-w64-nextpnr/src/nextpnr/ice40/chipdb/chipdb-384.bin"ice40/chipdb-1k.bin RCDATA "T:/mingw-w64-nextpnr/src/nextpnr/ice40/chipdb/chipdb-1k.bin"ice40/.... So it must be auto-generated, no sane human would maintain a file like that.

I see. I didn't realize llvm-rc used an actual separate preprocessor step. (RC has a crude one built-in that doesn't quite match any of the standards.)

Yup, I’ve also observed that the rc.exe built in one is a bit special. Llvm-rc didn’t use to have preprocessing at all before (you had to construct your own tool that did preprocessing manually beforehand, then calling llvm-rc), but as of recently, it can call clang as a subprocess.

Yep. This seems to be used by some project in the form ice40/chipdb-384.bin RCDATA "path/to/ice40/chipdb-384.bin" (see https://github.com/msys2/MINGW-packages/issues/9180). (Also some projects include filenames without quotes, e.g. 40 RCDATA unquoted-filename-with-dashes.ico as one identifier.)

It's even worse than that because there is no whitespace/newlines between entries: ice40/chipdb-384.bin RCDATA "T:/mingw-w64-nextpnr/src/nextpnr/ice40/chipdb/chipdb-384.bin"ice40/chipdb-1k.bin RCDATA "T:/mingw-w64-nextpnr/src/nextpnr/ice40/chipdb/chipdb-1k.bin"ice40/.... So it must be auto-generated, no sane human would maintain a file like that.

Oh, that’s absolutely terrible. Based on the error messages, I think the cmake files generate it. Regardless of this change, I think it’d be good to upstream a fix to them to at least add newlines inbetween.

This revision was landed with ongoing or failed builds.Jul 23 2021, 1:06 PM
This revision was automatically updated to reflect the committed changes.