This is an archive of the discontinued LLVM Phabricator instance.

[AsmParser] make .ascii support spaces as separators
ClosedPublic

Authored by jcai19 on Nov 13 2020, 1:42 PM.

Details

Summary

Currently the integrated assembler only allows commas as the separator
between string arguments in .ascii. This patch adds support to using
space as separators and make IAS consistent with GNU assembler.

Link: https://github.com/ClangBuiltLinux/linux/issues/1196

Diff Detail

Event Timeline

jcai19 created this revision.Nov 13 2020, 1:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 13 2020, 1:42 PM
jcai19 requested review of this revision.Nov 13 2020, 1:42 PM
jcai19 added subscribers: manojgupta, llozano.

This needs tests, both to show it accepts multiple strings and to show for each of the three directives that it produces the right output (i.e. is equivalent to multiple individual directives).

llvm/lib/MC/MCParser/AsmParser.cpp
3025–3026
3028

Should be "returns" not "return" otherwise it means something very different, though I'm not sure this is a particularly useful comment. You could also consider inverting the return value of the parseOp lambda above to be more intuitive.

This needs tests, both to show it accepts multiple strings and to show for each of the three directives that it produces the right output (i.e. is equivalent to multiple individual directives).

Thanks for the reminder. I did some local testing but forgot to add tests and was about to leave a comment here that I would add them soon.

It appears that I misunderstood the existing code. It actually supports multiple string arguments, but only when commas are used between the arguments. GAS supports commas or spaces as the separator (or when both are used in the same call such as .ascii "aa", "bb" "cc ). Will refactor my patch and update the commit message.

It appears that I misunderstood the existing code. It actually supports multiple string arguments, but only when commas are used between the arguments. GAS supports commas or spaces as the separator (or when both are used in the same call such as .ascii "aa", "bb" "cc ). Will refactor my patch and update the commit message.

In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

jcai19 updated this revision to Diff 305297.Nov 13 2020, 8:57 PM

Update commit message and add a test case.

jcai19 added a comment.EditedNov 13 2020, 9:03 PM

In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".

$ cat test.s 
.asciz "foo" "bar"
.asciz "foo", "bar"

$ arm-linux-gnueabihf-gcc test.s -c -o test.o
$ arm-linux-gnueabihf-objdump -dr test.o
gcc.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:	006f6f66 	.word	0x006f6f66
   4:	00726162 	.word	0x00726162
   8:	006f6f66 	.word	0x006f6f66
   c:	00726162 	.word	0x00726162

In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".

$ cat test.s 
.asciz "foo" "bar"
.asciz "foo", "bar"

$ arm-linux-gnueabihf-gcc test.s -c -o test.o
$ arm-linux-gnueabihf-objdump -dr test.o
gcc.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:	006f6f66 	.word	0x006f6f66
   4:	00726162 	.word	0x00726162
   8:	006f6f66 	.word	0x006f6f66
   c:	00726162 	.word	0x00726162

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

I agree only accepting comma as the separators would be clearer in newer code. On the other hand, adding the compatibility with GAS would allow IAS to support more legacy code base (although I'd admit I'm not sure how widely-used separating strings with space in these directives is). @nickdesaulniers Any thoughts and comments?

In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".

$ cat test.s 
.asciz "foo" "bar"
.asciz "foo", "bar"

$ arm-linux-gnueabihf-gcc test.s -c -o test.o
$ arm-linux-gnueabihf-objdump -dr test.o
gcc.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:	006f6f66 	.word	0x006f6f66
   4:	00726162 	.word	0x00726162
   8:	006f6f66 	.word	0x006f6f66
   c:	00726162 	.word	0x00726162

https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like .ascii and .asciz work differently in this regard?

jcai19 added a comment.EditedNov 16 2020, 1:18 PM

https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like .ascii and .asciz work differently in this regard?

So I played with the two directives a little bit, and I think they both treat a call with multiple string arguments the same as multiple calls with one argument. For example, .ascii produces the same disassembly with "foo" "bar", "foo", "bar" or "foobar" sine the directive does not append anything after string, while .asciz appends \0 and turns "foo" "bar" or "foo", "bar" into "foo\0bar" .

With .ascii:

$ cat test_space.s 
.ascii "foo" "bar"
$ cat test_comma.s 
.ascii "foo", "bar"
$ cat test_concatenation.s 
.ascii "foobar"

$ gcc test_space.s -c -o test_space.o
$ objdump -dr test_space.o

test_space.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	62                   	.byte 0x62
   4:	61                   	(bad)  
   5:	72                   	.byte 0x72

$ gcc test_comma.s -c -o test_comma.o
$ objdump -dr test_comma.o

test_comma.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	62                   	.byte 0x62
   4:	61                   	(bad)  
   5:	72                   	.byte 0x72

$ gcc test_concatenation.s -c -o test_concatenation.o
$ objdump -dr test_concatenation.o

test_concatenation.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	62                   	.byte 0x62
   4:	61                   	(bad)  
   5:	72                   	.byte 0x72

   4:	Address 0x0000000000000004 is out of bounds.

With .asciz:

$ cat test_space.s 
.asciz "foo" "bar"
$ cat test_comma.s 
.asciz "foo", "bar"
$ cat test_concatenation.s 
.asciz "foo\0bar"

$ gcc test_space.s -c -o test_space.o
$ objdump -dr test_space.o

test_space.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	00 62 61             	add    %ah,0x61(%rdx)
   6:	72 00                	jb     0x8

$ gcc test_comma.s -c -o test_comma.o
$ objdump -dr test_comma.o

test_comma.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	00 62 61             	add    %ah,0x61(%rdx)
   6:	72 00                	jb     0x8


$ gcc test_concatenation.s -c -o test_concatenation.o
$ objdump -dr test_concatenation.o

Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	00 62 61             	add    %ah,0x61(%rdx)
   6:	72 00                	jb     0x8

https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like .ascii and .asciz work differently in this regard?

So I played with the two directives a little bit, and I think they both treat a call with multiple string arguments the same as multiple calls with one argument. For example, .ascii produces the same disassembly with "foo" "bar", "foo", "bar" or "foobar" sine the directive does not append anything after string, while .asciz appends \0 and turns "foo" "bar" or "foo", "bar" into "foo\0bar" .

This looks like confirmation of what I said.

With .ascii:

$ cat test_space.s 
.ascii "foo" "bar"
$ cat test_comma.s 
.ascii "foo", "bar"
$ cat test_concatenation.s 
.ascii "foobar"

$ gcc test_space.s -c -o test_space.o
$ objdump -dr test_space.o

You can use llvm-readelf --string-dump=.text test_space.o to interpret the .text section as C style strings, though that might not be helpful for the .ascii examples.

In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".

$ cat test.s 
.asciz "foo" "bar"
.asciz "foo", "bar"

$ arm-linux-gnueabihf-gcc test.s -c -o test.o
$ arm-linux-gnueabihf-objdump -dr test.o
gcc.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:	006f6f66 	.word	0x006f6f66
   4:	00726162 	.word	0x00726162
   8:	006f6f66 	.word	0x006f6f66
   c:	00726162 	.word	0x00726162

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

I disagree. That's why .ascii is distinct from .asciz; if developers do not want NUL-terminated C style strings, they should use .ascii and not .asciz. The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU as such that clang can be used as a substitute. Not matching the behavior of GNU as precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.

@jcai19 this patch still needs tests for this new change of behavior to the parser. You have good test cases in your comments; those should be added into this patch itself.

rnk added a comment.Nov 16 2020, 2:44 PM

IMO matching C preprocessor behavior isn't enough of a compelling reason to diverge from gas's behavior here. Nick asked for more tests, and we could stand to test each directive with the new behavior. Code looks good to me.

jrtc27 added a comment.EditedNov 16 2020, 2:45 PM

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

I disagree. That's why .ascii is distinct from .asciz; if developers do not want NUL-terminated C style strings, they should use .ascii and not .asciz. The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU as such that clang can be used as a substitute. Not matching the behavior of GNU as precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.

Well, no, it is confusing for someone who doesn't know about the GNU syntax. If I see .asciz "foo" "bar" I'd assume it's equivalent to .asciz "foobar" not .asciz "foo"; .asciz "bar" and that .asciz is being used so there's a NUL on the end of the concatenated string.

FreeBSD has .asciz MACHINE_ARCH in one of its arm csu files (for an ELF note). Some architectures like to build up MACHINE_ARCH (defined in a header file) by concatenating multiple strings together for the different components. Currently arm doesn't do this (and only has two variants), but you can imagine other architectures might also want an ELF note and so would give confusing behaviour (you may not find it confusing, but I can guarantee you a large fraction of people would); if you want the C-like behaviour you have to use .ascii and add the NUL byte manually, as is done in your example.

But that's all justification for why I don't like GNU as's behaviour and why I don't like adding that feature to LLVM (I would not advocate for _different_ behaviour as that would cause even more problems, only to just not implement it at all). However, your example is a real-world case of something that really does need this feature and cannot easily be rewritten to use commas, so I see this as unavoidable and so yes, LLVM should implement this GNU as feature.

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

I disagree. That's why .ascii is distinct from .asciz; if developers do not want NUL-terminated C style strings, they should use .ascii and not .asciz. The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU as such that clang can be used as a substitute. Not matching the behavior of GNU as precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.

Well, no, it is confusing for someone who doesn't know about the GNU syntax. If I see .asciz "foo" "bar" I'd assume it's equivalent to .asciz "foobar" not .asciz "foo"; .asciz "bar" and that .asciz is being used so there's a NUL on the end of the concatenated string.

FreeBSD has .asciz MACHINE_ARCH in one of its arm csu files (for an ELF note). Some architectures like to build up MACHINE_ARCH (defined in a header file) by concatenating multiple strings together for the different components. Currently arm doesn't do this (and only has two variants), but you can imagine other architectures might also want an ELF note and so would give confusing behaviour (you may not find it confusing, but I can guarantee you a large fraction of people would); if you want the C-like behaviour you have to use .ascii and add the NUL byte manually, as is done in your example.

But that's all justification for why I don't like GNU as's behaviour and why I don't like adding that feature to LLVM (I would not advocate for _different_ behaviour as that would cause even more problems, only to just not implement it at all). However, your example is a real-world case of something that really does need this feature and cannot easily be rewritten to use commas, so I see this as unavoidable and so yes, LLVM should implement this GNU as feature.

I agree with @jrtc27 that juxtaposition with .asciz has a bug-prone behavior. This is unrelated to GNU as's preprocessing stage do_scrub_chars. The Linux kernel does not need .asciz, so if we cannot get a reasonable behavior for .asciz, we can simply not allow the usage.

jcai19 updated this revision to Diff 305875.Nov 17 2020, 12:15 PM

Thanks all for the comments. From what I read so far, we agree thatt there is a real need to implement this feature in LLVM so add more test cases based on the comments.

MaskRay added a comment.EditedNov 17 2020, 12:28 PM

I mentioned the .asciz issue to binutils.... The reaction was a bit quick .. https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=3d955acb36f483c05724181da5ffba46b1303c43

I think rejecting .asciz "a" "b" may be fine for us since there is no real world usage.

Then I am not sure we need a new method parseManyWithOptionalComma. It can be specialized.

I think rejecting .asciz "a" "b" may be fine for us since there is no real world usage.

Then I am not sure we need a new method parseManyWithOptionalComma. It can be specialized.

Does that mean we support space as separators in .ascii but not for .asciz and .string? Wouldn't that create more confusion?

llvm/test/MC/AsmParser/directive_ascii.s
74

sorry, is there an extra two 00 in the output?

96

sorry, is there an extra two 00 in the output?

bcain added a subscriber: bcain.Nov 19 2020, 12:40 PM
jcai19 added inline comments.Nov 19 2020, 3:50 PM
llvm/test/MC/AsmParser/directive_ascii.s
74

This is what llvm-mc prints. I think both \0 and \000 mean 0 in octal form.

MaskRay added inline comments.Nov 19 2020, 3:54 PM
llvm/test/MC/AsmParser/directive_ascii.s
50–57

The test may be a bit too verbose in this way.

One .ascii "A", "B" "C", "D" is sufficient for the coverage.

71

.asciz "A", "B" "C", "D" should produce A\0BC\0D\0 as GNU as 2.36 will do

jcai19 updated this revision to Diff 306562.Nov 19 2020, 5:07 PM

Simplify test cases as @MaskRay suggested.

jcai19 retitled this revision from [AsmParser] make .ascii/.asciz/.string support multiple strings to [AsmParser] make .ascii/.asciz/.string support spaces as the separator.Nov 19 2020, 6:07 PM
jcai19 edited the summary of this revision. (Show Details)
jcai19 retitled this revision from [AsmParser] make .ascii/.asciz/.string support spaces as the separator to [AsmParser] make .ascii/.asciz/.string support spaces as separators.
jrtc27 added inline comments.Nov 19 2020, 6:29 PM
llvm/test/MC/AsmParser/directive_ascii.s
56

This no longer matches binutils and is the specific part of this patch that has been contentious.

jcai19 added inline comments.Nov 20 2020, 12:25 PM
llvm/test/MC/AsmParser/directive_ascii.s
56

Do you mean the .asciz test case? I think this is what gnu AS does.

$ cat test1.s
.asciz "A", "B" "C"

$ gcc test1.s -c -o test1.o
llvm-readelf --string-dump=.text test1.o
String dump of section '.text':
[ 0] A
[ 2] B
[ 4] C

$ cat test2.s
asciz "A\0B\0C"

$ gcc test2.s -c -o test2.o
llvm-readelf --string-dump=.text test2.o
String dump of section '.text':
[ 0] A
[ 2] B
[ 4] C

$ cat test3.s
asciz "ABC"

$ gcc test3.s -c -o test3.o
llvm-readelf --string-dump=.text test3.o
String dump of section '.text':
[ 0] ABC

jrtc27 added inline comments.Nov 20 2020, 12:26 PM
llvm/test/MC/AsmParser/directive_ascii.s
56
jcai19 added inline comments.Nov 20 2020, 12:45 PM
llvm/test/MC/AsmParser/directive_ascii.s
56

Ah thanks I did miss the link. This seems like a big change and would potentially breaking existing code, @MaskRay do you have a link to the binutils bug report?

jcai19 updated this revision to Diff 308486.Nov 30 2020, 2:52 PM
jcai19 edited the summary of this revision. (Show Details)

Since GNU assembler changed how it interpreted using space as separators in .asciz and .string in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=3d955acb36f483c05724181da5ffba46b1303c43, limit this patch to .ascii only as matching the new GAS behavior in the two directives may cause subtle issues when building existing code that assumes the old behavior with LLVM's integrated assembler.

jcai19 retitled this revision from [AsmParser] make .ascii/.asciz/.string support spaces as separators to [AsmParser] make .ascii support spaces as separators.Nov 30 2020, 2:53 PM
jcai19 edited the summary of this revision. (Show Details)
llvm/lib/MC/MCParser/AsmParser.cpp
3041–3042

Would a ternary expression make sense here?

if (ZeroTerminated ? parseMany(parseOp) : parseManyWithOptionalComma())
  return ...;
jrtc27 added inline comments.Nov 30 2020, 3:00 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006

This grammar needs updating

3024–3025

It'd be nicer to push this all up into parseOp so it just parses all the strings it finds one at a time. For .ascii you can get away with treating spaces and commas the same, but if we ever want to support the new .asciz/.string behaviour (which I feel we should one day, provided binutils doesn't decide to revert its behaviour) we'll need it to be like that, so it'd make it trivial to enable. Also I feel it's a cleaner (and likely simpler) way to write it regardless.

jcai19 updated this revision to Diff 308492.Nov 30 2020, 3:42 PM
jcai19 edited the summary of this revision. (Show Details)

Address comments.

jcai19 added inline comments.Dec 3 2020, 4:24 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3024–3025

Thanks for the comment. Just to clarify, do you suggest we merge the code of parseManyWithOptionalComma into parseOp? parseOp is used as an argument to parseMany later, so if we change it, then we may have to make adjustments to parseMany too, which is used in other places. Or maybe what you suggest is that we should completely rewrite parseDirectiveAscii function and not call parseMany at all? Also I am not sure about rewriting the code assuming the new GNU behaviors will be the way going forward. I could not find any discussion on the change and its potential implications, and therefore could not confirm if that would be what GNU community agrees on.

jrtc27 added inline comments.Dec 3 2020, 4:40 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3024–3025

I mean:

  • remove parseManyWithOptionalComma
  • keep using parseMany(parseOp)
  • leave parseMany unchanged
  • make parseOp parse _all_ strings up to the first comma, effectively treating them like a single multi-token operand (i.e. keep calling parseEscapedString and emitting the string while there's still another AsmToken::String)
jcai19 added inline comments.Dec 3 2020, 5:32 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3024–3025

Thanks for the clarification. IIUC, we only change parseOp to treat space-separated strings as one string until we hit a comma or end of statement, and keep everything else unchanged. Wouldn't that also affect .asciz and .string and make them behave like the new GNU behavior?

jrtc27 added inline comments.Dec 3 2020, 5:34 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3024–3025

Yes, so for now you can add an additional check to bail out after the first iteration of the do-while loop if ZeroTerminated is true, and at some point in the future all that's needed is to delete that additional check to make .asciz and .string behave like the new GNU binutils.

jcai19 updated this revision to Diff 309658.Dec 4 2020, 2:23 PM

Address @jrtc27's comment. paseOp function now skips all the spaces between string arguments until it hits any comma or end of statement. For now we will limit the support to .ascii only.

jrtc27 added inline comments.Dec 4 2020, 2:35 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006

Hm, this isn't particularly clear as it stands that multiple strings next to each other is valid, it just looks like the comma has been put it in brackets. Maybe this to better reflect how it's parsed (square brackets around the comma instead would also work, but doesn't capture the logical parse tree so well)? I don't know if the "string" needs to be put in parentheses though to apply + to it; is this a real BNF-esque grammar or just a hand-wavey one for human consumption?

jcai19 added inline comments.Dec 4 2020, 3:03 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006

It appears to be only for human consumption, at least I could not find anywhere it was parsed. For .ascii we could use either space or comma to achieve the same effect, which made comma optional. Based on the original text, I assume brackets mean the enclosed symbol is mandatory while parentheses means it is optional.

jrtc27 added inline comments.Dec 4 2020, 3:09 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006

No, parens are for grouping (so * applies to multiple items), and square brackets indicate the items are optional. https://en.wikipedia.org/wiki/Backus–Naur_form#Variants describes some of these common BNF-based grammars. In this case I think leaving the parens out of my suggestion is fine, nobody in their right mind would think the + means that it applies to the double quote (i.e. that, say, "string"" is valid), and strictly speaking it should really be '"' string '"' or similar if you want it to be a proper formal grammar. So unless you have objections I think my suggestion is the right thing to go with :)

jcai19 added inline comments.Dec 4 2020, 3:18 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006

Okay I double checked the syntax of .ascii and you are right the string arguments are optional. Will update accordingly.

jcai19 updated this revision to Diff 309678.Dec 4 2020, 3:19 PM

Update the syntax in the comment.

jrtc27 added inline comments.Dec 4 2020, 3:22 PM
llvm/test/MC/AsmParser/directive_ascii.s
71

Hmm, MaskRay suggested a test case with , "D" on the end, that's probably a good idea.

jcai19 updated this revision to Diff 309687.Dec 4 2020, 3:54 PM

Updated a test case.

reames resigned from this revision.Dec 5 2020, 2:36 PM
jcai19 added a comment.EditedDec 17 2020, 1:37 PM

Just want to follow up and see if anyone might have any thoughts on the current implementation of this patch. Otherwise, is it good to submit the patch? @jrtc27 @rnk @nickdesaulniers @MaskRay

jrtc27 added inline comments.Dec 17 2020, 1:42 PM
llvm/test/MC/AsmParser/directive_ascii.s
52–56

I don't think I understand why these get printed with .byte but others with .ascii. I think it'd be important to make these use CHECK-NEXT (and CHECK-LABEL) though to assert that there aren't 0s sneaking in (which this test is poor at, the existing cases are not well-written).

jcai19 updated this revision to Diff 312615.Dec 17 2020, 2:11 PM

Update the test case.

jcai19 added inline comments.Dec 17 2020, 2:41 PM
llvm/test/MC/AsmParser/directive_ascii.s
52–56

< I don't think I understand why these get printed with .byte but others with .ascii

It seems llvm-mc prints ascii code for single bytes, as shown in other test cases. Or perhaps your question is about the implementation of llvm-mc?

< I think it'd be important to make these use CHECK-NEXT (and CHECK-LABEL) though to assert that there aren't 0s sneaking in (which this test is poor at, the existing cases are not well-written).

Updated. Thanks.

jrtc27 added inline comments.Dec 17 2020, 2:50 PM
llvm/test/MC/AsmParser/directive_ascii.s
52–56

Ok that makes sense. In an ideal world this would be .byte 65 ; .ascii "BC" ; .byte 68 then but that's purely cosmetic and not worth wasting time implementing (you'd have to concatenate to Data in the new parse loop and then emit it all at once at the end of the loop).

jrtc27 accepted this revision.Dec 17 2020, 2:50 PM

Looks good to me, let's see what the others think.

This revision is now accepted and ready to land.Dec 17 2020, 2:50 PM

Looks good to me, let's see what the others think.

Thanks. Will leave it open for another 24 hours for comments.

nickdesaulniers accepted this revision.Dec 17 2020, 4:14 PM

Thanks Jian! (I can get a little further assembling the Linux kernel's arch/arm/probes/kprobes/test-arm.o (allyesconfig) with Clang's integrated assembler with this; though that file now has additional failures unrelated to this patch.

This revision was landed with ongoing or failed builds.Dec 20 2020, 10:45 PM
This revision was automatically updated to reflect the committed changes.

Which GDB tests? How are they broken? I'm extremely surprised by the given this patch implements what binutils has done for a very long time (only .asciz/.string have changed behaviour in binutils, and those are not modified by this patch here so should not have regressions).

This seems like a premature revert and probably just causes churn to the LLVM upstream to me...The last revision looks good to me and resolves a GNU as/LLVM integrated assembler difference which we agreed the integrated assembler should improve in that case.

Upstream gdb developers don't build Clang. Many tests may be written in a GCC/binutils dependent way. The issue is not necessarily the integrated assembler issue. It can well be in another place we haven't noticed yet. If indeed that way, internally we can just disable the gdb tests. If there is a Clang 11 vs Clang 12 difference. We could make the upstream gdb tests more amenable to more compiler versions.

MaskRay reopened this revision.Jan 13 2021, 2:56 PM
This revision is now accepted and ready to land.Jan 13 2021, 2:56 PM

Which GDB tests? How are they broken? I'm extremely surprised by the given this patch implements what binutils has done for a very long time (only .asciz/.string have changed behaviour in binutils, and those are not modified by this patch here so should not have regressions).

The reason is unclear yet and may as well not be caused by this change as @MaskRay pointed out. Will reland or update the patch once we know the reason.

MaskRay added inline comments.Jan 13 2021, 9:53 PM
llvm/lib/MC/MCParser/AsmParser.cpp
3013

A complete sentence needs a period.

jcai19 closed this revision.Jan 14 2021, 5:55 PM
jcai19 marked an inline comment as done.

The test failure turned out to be a false positive. Relanded the change on https://github.com/llvm/llvm-project/commit/9dfeec853008109b1cbe926c22675c96226040d9