This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
test/tools/llvm-rc/
-
tools/
-
llvm-rc/
-
Inputs/
-
resname-string.rc
-
tokens.rc
-
resname-string.test
-
tokenizer.test
-
tools/llvm-rc/
-
llvm-rc/
-
ResourceScriptToken.cpp

Differential D106598

[llvm-rc] Allow dashes as part of resource name strings
ClosedPublic

Authored by mstorsjo on Jul 22 2021, 2:42 PM.

Download Raw Diff

Details

Reviewers

amccarth

Commits

rG0a1683f8cc0d: [llvm-rc] Allow dashes as part of resource name strings

Summary

This matches what MS rc.exe allows in practice. I'm not aware of
any legal syntax case that are broken by allowing dashes as part
of what the tokenizer considers an Identifier - but I'm not
very well versed in the RC syntax either, can @amccarth think of
any case that would be broken by this?

This fixes downstream bug
https://github.com/msys2/MINGW-packages/issues/9180.

Additionally, rc.exe allows such resource name strings to be surrounded
by quotes, ending up with e.g.

Resource name (string): "QUOTEDNAME"

(i.e., the quotes end up as part of the string), which llvm-rc doesn't
support yet either. (I'm not aware of such cases in the wild though,
but resource string names with dashes do exist.)

This also allows including files with unquoted paths, with filenames
containing dashes (which fixes
https://github.com/msys2/MINGW-packages/issues/9130, which has been
worked around differently so far).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mstorsjo requested review of this revision.Jul 22 2021, 2:42 PM

mstorsjo created this revision.

Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2021, 2:42 PM

Harbormaster completed remote builds in B115699: Diff 360982.Jul 22 2021, 5:12 PM

Cool. I'd never seen anyone use strings that don't look like a C or C++ identifier as a resource identifier, but it should work given that FindResource just uses the string as an opaque key.

If we maximally munch everywhere, we could consume an expression as a single identifier. For example:

#define WIDTH 400
#define HEIGHT 300
#define MARGIN 8

MyDlg DIALOGBOXEX 10, 10, WIDTH, HEIGHT
BEGIN
  LTEXT "Text in dialog", 4, 4, WIDTH-MARGIN, HEIGHT-MARGIN
END

In this context, WIDTH-MARGIN better be recognized as three tokens representing an expression rather than as a single (undefined) identifier. I know there's code out there like that.

Also, we'd better not allow these types of identifiers to be defined, i.e., don't allow #define FOO-BAR 42.

If you add those examples as test cases and they do the right thing, I'd be happy with this change.

My guess is that RC.EXE is implemented as a recursive descent parser that doesn't really have a uniform concept of "tokens." When it's expecting a "numeric or string identifier," it applies one set of munching rules, but when it's expecting an expression, it applies a different set. I haven't looked closely enough to see whether llvm-rc has that kind of flexibility at tokenization.

In D106598#2900614, @amccarth wrote:

Cool. I'd never seen anyone use strings that don't look like a C or C++ identifier as a resource identifier, but it should work given that FindResource just uses the string as an opaque key.

Yep. This seems to be used by some project in the form ice40/chipdb-384.bin RCDATA "path/to/ice40/chipdb-384.bin" (see https://github.com/msys2/MINGW-packages/issues/9180). (Also some projects include filenames without quotes, e.g. 40 RCDATA unquoted-filename-with-dashes.ico as one identifier.)

If we maximally munch everywhere, we could consume an expression as a single identifier. For example:
#define WIDTH 400
#define HEIGHT 300
#define MARGIN 8

MyDlg DIALOGBOXEX 10, 10, WIDTH, HEIGHT
BEGIN
  LTEXT "Text in dialog", 4, 4, WIDTH-MARGIN, HEIGHT-MARGIN
END
In this context, WIDTH-MARGIN better be recognized as three tokens representing an expression rather than as a single (undefined) identifier. I know there's code out there like that.

Yes, but in this case, WIDTH-MARGIN would be expanded into 400-8 by the preprocessor, so llvm-rc's own parser would never see that expression in that form, only in the numeric form.

So if it's only possible at the preprocessor level to have expressions with named constants, but not on the RC level itself, then I think this change should be safe.

Also, we'd better not allow these types of identifiers to be defined, i.e., don't allow #define FOO-BAR 42.

If you add those examples as test cases and they do the right thing, I'd be happy with this change.

Unfortunately, the main llvm-rc tests are on output from the preprocessor, as clang isn't available as a dependency for anything under the llvm repo. The converse is possible, we can have a test under clang that invokes llvm-rc including preprocessing. We have some very small tests of that, invoking llvm-rc just to see that the preprocessing works though. I'm a bit unsure whether we should add more llvm-rc specific testcases in clang, as the preprocessor output in this case simply is 400-8.

My guess is that RC.EXE is implemented as a recursive descent parser that doesn't really have a uniform concept of "tokens." When it's expecting a "numeric or string identifier," it applies one set of munching rules, but when it's expecting an expression, it applies a different set.

Yup, I get that feeling too.

I haven't looked closely enough to see whether llvm-rc has that kind of flexibility at tokenization.

Not sure; llvm-rc first tokenizes the whole input, and then reads the feed of tokens. There's some different functions that expect different kinds of tokens depending on the context though, but e.g. for this case, I much prefer a single Identifier token to include all of this, instead of having a higher level logic that might try to fuse multiple tokens under some specific circumstances.

I see. I didn't realize llvm-rc used an actual separate preprocessor step. (RC has a crude one built-in that doesn't quite match any of the standards.)

Given external preprocessing, I can't think of any other problems that allowing - in identifiers would introduce, so this LGTM.

This revision is now accepted and ready to land.Jul 23 2021, 10:03 AM

In D106598#2900656, @mstorsjo wrote:

Yep. This seems to be used by some project in the form ice40/chipdb-384.bin RCDATA "path/to/ice40/chipdb-384.bin" (see https://github.com/msys2/MINGW-packages/issues/9180). (Also some projects include filenames without quotes, e.g. 40 RCDATA unquoted-filename-with-dashes.ico as one identifier.)

It's even worse than that because there is no whitespace/newlines between entries: ice40/chipdb-384.bin RCDATA "T:/mingw-w64-nextpnr/src/nextpnr/ice40/chipdb/chipdb-384.bin"ice40/chipdb-1k.bin RCDATA "T:/mingw-w64-nextpnr/src/nextpnr/ice40/chipdb/chipdb-1k.bin"ice40/.... So it must be auto-generated, no sane human would maintain a file like that.

In D106598#2900730, @amccarth wrote:

I see. I didn't realize llvm-rc used an actual separate preprocessor step. (RC has a crude one built-in that doesn't quite match any of the standards.)

Yup, I’ve also observed that the rc.exe built in one is a bit special. Llvm-rc didn’t use to have preprocessing at all before (you had to construct your own tool that did preprocessing manually beforehand, then calling llvm-rc), but as of recently, it can call clang as a subprocess.

In D106598#2900772, @jeremyd2019 wrote:

In D106598#2900656, @mstorsjo wrote:

Yep. This seems to be used by some project in the form ice40/chipdb-384.bin RCDATA "path/to/ice40/chipdb-384.bin" (see https://github.com/msys2/MINGW-packages/issues/9180). (Also some projects include filenames without quotes, e.g. 40 RCDATA unquoted-filename-with-dashes.ico as one identifier.)

It's even worse than that because there is no whitespace/newlines between entries: ice40/chipdb-384.bin RCDATA "T:/mingw-w64-nextpnr/src/nextpnr/ice40/chipdb/chipdb-384.bin"ice40/chipdb-1k.bin RCDATA "T:/mingw-w64-nextpnr/src/nextpnr/ice40/chipdb/chipdb-1k.bin"ice40/.... So it must be auto-generated, no sane human would maintain a file like that.

Oh, that’s absolutely terrible. Based on the error messages, I think the cmake files generate it. Regardless of this change, I think it’d be good to upstream a fix to them to at least add newlines inbetween.

This revision was landed with ongoing or failed builds.Jul 23 2021, 1:06 PM

Closed by commit rG0a1683f8cc0d: [llvm-rc] Allow dashes as part of resource name strings (authored by mstorsjo). · Explain Why

This revision was automatically updated to reflect the committed changes.

mstorsjo added a commit: rG0a1683f8cc0d: [llvm-rc] Allow dashes as part of resource name strings.

Revision Contents

Path

Size

llvm/

test/

tools/

llvm-rc/

Inputs/

2 lines

1 line

5 lines

1 line

tools/

llvm-rc/

ResourceScriptToken.cpp

2 lines

Diff 361313

llvm/test/tools/llvm-rc/Inputs/resname-string.rc

This file was added.

				stringname RCDATA { "foo" }
				name-with-dashes/and/slashes RCDATA { "foo" }

llvm/test/tools/llvm-rc/Inputs/tokens.rc

	1 + 2 - 3214L & 0x120894 032173 2\|&~+(-7){0xabcdef 0xABCDEFl} Begin End			1 + 2 - 3214L & 0x120894 032173 2\|&~+(-7){0xabcdef 0xABCDEFl} Begin End
	He11o LLVM			He11o LLVM
				identifier-with-dashes

	"RC string test.",L"Another RC string test.'&{",42,100			"RC string test.",L"Another RC string test.'&{",42,100

	Block Comment Ident /block / // comment */ ifier			Block Comment Ident /block / // comment */ ifier

	Line Comment // Identifier /*			Line Comment // Identifier /*

	/* Multi line			/* Multi line
	block			block
	comment */			comment */

	Multiple /* comments / on / a */ single // line			Multiple /* comments / on / a */ single // line

	":))"			":))"

llvm/test/tools/llvm-rc/resname-string.test

This file was added.

				; RUN: llvm-rc -no-preprocess /FO %t.res -- %p/Inputs/resname-string.rc
				; RUN: llvm-readobj %t.res \| FileCheck %s

				; CHECK: Resource name (string): STRINGNAME
				; CHECK: Resource name (string): NAME-WITH-DASHES/AND/SLASHES

llvm/test/tools/llvm-rc/tokenizer.test

	Show All 21 Lines
	; CHECK-NEXT: BlockBegin: {			; CHECK-NEXT: BlockBegin: {
	; CHECK-NEXT: Int: 0xabcdef; int value = 11259375			; CHECK-NEXT: Int: 0xabcdef; int value = 11259375
	; CHECK-NEXT: Int: 0xABCDEFl; int value = 11259375			; CHECK-NEXT: Int: 0xABCDEFl; int value = 11259375
	; CHECK-NEXT: BlockEnd: }			; CHECK-NEXT: BlockEnd: }
	; CHECK-NEXT: BlockBegin: Begin			; CHECK-NEXT: BlockBegin: Begin
	; CHECK-NEXT: BlockEnd: End			; CHECK-NEXT: BlockEnd: End
	; CHECK-NEXT: Identifier: He11o			; CHECK-NEXT: Identifier: He11o
	; CHECK-NEXT: Identifier: LLVM			; CHECK-NEXT: Identifier: LLVM
				; CHECK-NEXT: Identifier: identifier-with-dashes
	; CHECK-NEXT: String: "RC string test."			; CHECK-NEXT: String: "RC string test."
	; CHECK-NEXT: Comma: ,			; CHECK-NEXT: Comma: ,
	; CHECK-NEXT: String: L"Another RC string test.'&{"			; CHECK-NEXT: String: L"Another RC string test.'&{"
	; CHECK-NEXT: Comma: ,			; CHECK-NEXT: Comma: ,
	; CHECK-NEXT: Int: 42; int value = 42			; CHECK-NEXT: Int: 42; int value = 42
	; CHECK-NEXT: Comma: ,			; CHECK-NEXT: Comma: ,
	; CHECK-NEXT: Int: 100; int value = 100			; CHECK-NEXT: Int: 100; int value = 100
	; CHECK-NEXT: Identifier: Block			; CHECK-NEXT: Identifier: Block
	Show All 9 Lines

llvm/tools/llvm-rc/ResourceScriptToken.cpp

Show First 20 Lines • Show All 282 Lines • ▼ Show 20 Lines	bool Tokenizer::canStartIdentifier() const {
const char CurChar = Data[Pos];		const char CurChar = Data[Pos];
return std::isalpha(CurChar) \|\| CurChar == '_' \|\| CurChar == '.';		return std::isalpha(CurChar) \|\| CurChar == '_' \|\| CurChar == '.';
}		}

bool Tokenizer::canContinueIdentifier() const {		bool Tokenizer::canContinueIdentifier() const {
assert(!streamEof());		assert(!streamEof());
const char CurChar = Data[Pos];		const char CurChar = Data[Pos];
return std::isalnum(CurChar) \|\| CurChar == '_' \|\| CurChar == '.' \|\|		return std::isalnum(CurChar) \|\| CurChar == '_' \|\| CurChar == '.' \|\|
CurChar == '/' \|\| CurChar == '\\';		CurChar == '/' \|\| CurChar == '\\' \|\| CurChar == '-';
}		}

bool Tokenizer::canStartInt() const {		bool Tokenizer::canStartInt() const {
assert(!streamEof());		assert(!streamEof());
return std::isdigit(Data[Pos]);		return std::isdigit(Data[Pos]);
}		}

bool Tokenizer::canStartBlockComment() const {		bool Tokenizer::canStartBlockComment() const {
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines