This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/MC/MCParser/
-
MC/
-
MCParser/
5/14
AsmParser.cpp
-
test/MC/AsmParser/
-
MC/
-
AsmParser/
4/13
directive_ascii.s

Differential D91460

[AsmParser] make .ascii support spaces as separators
ClosedPublic

Authored by jcai19 on Nov 13 2020, 1:42 PM.

Download Raw Diff

Details

Reviewers

nickdesaulniers
MaskRay
peter.smith
ostannard
efriedma
rnk
craig.topper
jrtc27
reames

Commits

rGe0963ae274be: [AsmParser] make .ascii support spaces as separators

Summary

Currently the integrated assembler only allows commas as the separator
between string arguments in .ascii. This patch adds support to using
space as separators and make IAS consistent with GNU assembler.

Link: https://github.com/ClangBuiltLinux/linux/issues/1196

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	430 ms	x64 debian > HWAddressSanitizer-x86_64.TestCases::sizes.cpp

Event Timeline

jcai19 created this revision.Nov 13 2020, 1:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 13 2020, 1:42 PM

Herald added subscribers: llvm-commits, atanasyan, jrtc27 and 2 others. · View Herald Transcript

jcai19 requested review of this revision.Nov 13 2020, 1:42 PM

jcai19 added a reviewer: nickdesaulniers.Nov 13 2020, 1:42 PM

jcai19 added subscribers: manojgupta, llozano.

This needs tests, both to show it accepts multiple strings and to show for each of the three directives that it produces the right output (i.e. is equivalent to multiple individual directives).

llvm/lib/MC/MCParser/AsmParser.cpp
3025–3026
3028	Should be "returns" not "return" otherwise it means something very different, though I'm not sure this is a particularly useful comment. You could also consider inverting the return value of the parseOp lambda above to be more intuitive.

jcai19 added reviewers: MaskRay, peter.smith, ostannard, efriedma, rnk, craig.topper, reames.Nov 13 2020, 1:48 PM

In D91460#2395030, @jrtc27 wrote:

This needs tests, both to show it accepts multiple strings and to show for each of the three directives that it produces the right output (i.e. is equivalent to multiple individual directives).

Thanks for the reminder. I did some local testing but forgot to add tests and was about to leave a comment here that I would add them soon.

Harbormaster completed remote builds in B78814: Diff 305254.Nov 13 2020, 2:14 PM

It appears that I misunderstood the existing code. It actually supports multiple string arguments, but only when commas are used between the arguments. GAS supports commas or spaces as the separator (or when both are used in the same call such as .ascii "aa", "bb" "cc ). Will refactor my patch and update the commit message.

In D91460#2395236, @jcai19 wrote:

It appears that I misunderstood the existing code. It actually supports multiple string arguments, but only when commas are used between the arguments. GAS supports commas or spaces as the separator (or when both are used in the same call such as .ascii "aa", "bb" "cc ). Will refactor my patch and update the commit message.

In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

Update commit message and add a test case.

In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".

$ cat test.s 
.asciz "foo" "bar"
.asciz "foo", "bar"

$ arm-linux-gnueabihf-gcc test.s -c -o test.o
$ arm-linux-gnueabihf-objdump -dr test.o
gcc.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:	006f6f66 	.word	0x006f6f66
   4:	00726162 	.word	0x00726162
   8:	006f6f66 	.word	0x006f6f66
   c:	00726162 	.word	0x00726162

In D91460#2395521, @jcai19 wrote:
In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".
$ cat test.s 
.asciz "foo" "bar"
.asciz "foo", "bar"

$ arm-linux-gnueabihf-gcc test.s -c -o test.o
$ arm-linux-gnueabihf-objdump -dr test.o
gcc.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:	006f6f66 	.word	0x006f6f66
   4:	00726162 	.word	0x00726162
   8:	006f6f66 	.word	0x006f6f66
   c:	00726162 	.word	0x00726162

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

I agree only accepting comma as the separators would be clearer in newer code. On the other hand, adding the compatibility with GAS would allow IAS to support more legacy code base (although I'd admit I'm not sure how widely-used separating strings with space in these directives is). @nickdesaulniers Any thoughts and comments?

Harbormaster completed remote builds in B78842: Diff 305297.Nov 13 2020, 9:30 PM

In D91460#2395521, @jcai19 wrote:
In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".
$ cat test.s 
.asciz "foo" "bar"
.asciz "foo", "bar"

$ arm-linux-gnueabihf-gcc test.s -c -o test.o
$ arm-linux-gnueabihf-objdump -dr test.o
gcc.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:	006f6f66 	.word	0x006f6f66
   4:	00726162 	.word	0x00726162
   8:	006f6f66 	.word	0x006f6f66
   c:	00726162 	.word	0x00726162

https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like .ascii and .asciz work differently in this regard?

https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like .ascii and .asciz work differently in this regard?

So I played with the two directives a little bit, and I think they both treat a call with multiple string arguments the same as multiple calls with one argument. For example, .ascii produces the same disassembly with "foo" "bar", "foo", "bar" or "foobar" sine the directive does not append anything after string, while .asciz appends \0 and turns "foo" "bar" or "foo", "bar" into "foo\0bar" .

With .ascii:

$ cat test_space.s 
.ascii "foo" "bar"
$ cat test_comma.s 
.ascii "foo", "bar"
$ cat test_concatenation.s 
.ascii "foobar"

$ gcc test_space.s -c -o test_space.o
$ objdump -dr test_space.o

test_space.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	62                   	.byte 0x62
   4:	61                   	(bad)  
   5:	72                   	.byte 0x72

$ gcc test_comma.s -c -o test_comma.o
$ objdump -dr test_comma.o

test_comma.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	62                   	.byte 0x62
   4:	61                   	(bad)  
   5:	72                   	.byte 0x72

$ gcc test_concatenation.s -c -o test_concatenation.o
$ objdump -dr test_concatenation.o

test_concatenation.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	62                   	.byte 0x62
   4:	61                   	(bad)  
   5:	72                   	.byte 0x72

   4:	Address 0x0000000000000004 is out of bounds.

With .asciz:

$ cat test_space.s 
.asciz "foo" "bar"
$ cat test_comma.s 
.asciz "foo", "bar"
$ cat test_concatenation.s 
.asciz "foo\0bar"

$ gcc test_space.s -c -o test_space.o
$ objdump -dr test_space.o

test_space.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	00 62 61             	add    %ah,0x61(%rdx)
   6:	72 00                	jb     0x8

$ gcc test_comma.s -c -o test_comma.o
$ objdump -dr test_comma.o

test_comma.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	00 62 61             	add    %ah,0x61(%rdx)
   6:	72 00                	jb     0x8


$ gcc test_concatenation.s -c -o test_concatenation.o
$ objdump -dr test_concatenation.o

Disassembly of section .text:

0000000000000000 <.text>:
   0:	66 6f                	outsw  %ds:(%rsi),(%dx)
   2:	6f                   	outsl  %ds:(%rsi),(%dx)
   3:	00 62 61             	add    %ah,0x61(%rdx)
   6:	72 00                	jb     0x8

In D91460#2398062, @jcai19 wrote:

https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like .ascii and .asciz work differently in this regard?

So I played with the two directives a little bit, and I think they both treat a call with multiple string arguments the same as multiple calls with one argument. For example, .ascii produces the same disassembly with "foo" "bar", "foo", "bar" or "foobar" sine the directive does not append anything after string, while .asciz appends \0 and turns "foo" "bar" or "foo", "bar" into "foo\0bar" .

This looks like confirmation of what I said.

With .ascii:

$ cat test_space.s 
.ascii "foo" "bar"
$ cat test_comma.s 
.ascii "foo", "bar"
$ cat test_concatenation.s 
.ascii "foobar"

$ gcc test_space.s -c -o test_space.o
$ objdump -dr test_space.o

You can use llvm-readelf --string-dump=.text test_space.o to interpret the .text section as C style strings, though that might not be helpful for the .ascii examples.

In D91460#2395522, @jrtc27 wrote:
In D91460#2395521, @jcai19 wrote:
In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?

I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".
$ cat test.s 
.asciz "foo" "bar"
.asciz "foo", "bar"

$ arm-linux-gnueabihf-gcc test.s -c -o test.o
$ arm-linux-gnueabihf-objdump -dr test.o
gcc.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:	006f6f66 	.word	0x006f6f66
   4:	00726162 	.word	0x00726162
   8:	006f6f66 	.word	0x006f6f66
   c:	00726162 	.word	0x00726162
In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

I disagree. That's why .ascii is distinct from .asciz; if developers do not want NUL-terminated C style strings, they should use .ascii and not .asciz. The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU as such that clang can be used as a substitute. Not matching the behavior of GNU as precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.

@jcai19 this patch still needs tests for this new change of behavior to the parser. You have good test cases in your comments; those should be added into this patch itself.

IMO matching C preprocessor behavior isn't enough of a compelling reason to diverge from gas's behavior here. Nick asked for more tests, and we could stand to test each directive with the new behavior. Code looks good to me.

In D91460#2395522, @jrtc27 wrote:

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

I disagree. That's why .ascii is distinct from .asciz; if developers do not want NUL-terminated C style strings, they should use .ascii and not .asciz. The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU as such that clang can be used as a substitute. Not matching the behavior of GNU as precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.

Well, no, it is confusing for someone who doesn't know about the GNU syntax. If I see .asciz "foo" "bar" I'd assume it's equivalent to .asciz "foobar" not .asciz "foo"; .asciz "bar" and that .asciz is being used so there's a NUL on the end of the concatenated string.

FreeBSD has .asciz MACHINE_ARCH in one of its arm csu files (for an ELF note). Some architectures like to build up MACHINE_ARCH (defined in a header file) by concatenating multiple strings together for the different components. Currently arm doesn't do this (and only has two variants), but you can imagine other architectures might also want an ELF note and so would give confusing behaviour (you may not find it confusing, but I can guarantee you a large fraction of people would); if you want the C-like behaviour you have to use .ascii and add the NUL byte manually, as is done in your example.

But that's all justification for why I don't like GNU as's behaviour and why I don't like adding that feature to LLVM (I would not advocate for _different_ behaviour as that would cause even more problems, only to just not implement it at all). However, your example is a real-world case of something that really does need this feature and cannot easily be rewritten to use commas, so I see this as unavoidable and so yes, LLVM should implement this GNU as feature.

In D91460#2398228, @jrtc27 wrote:

In D91460#2395522, @jrtc27 wrote:

In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?

I disagree. That's why .ascii is distinct from .asciz; if developers do not want NUL-terminated C style strings, they should use .ascii and not .asciz. The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU as such that clang can be used as a substitute. Not matching the behavior of GNU as precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.

Well, no, it is confusing for someone who doesn't know about the GNU syntax. If I see .asciz "foo" "bar" I'd assume it's equivalent to .asciz "foobar" not .asciz "foo"; .asciz "bar" and that .asciz is being used so there's a NUL on the end of the concatenated string.

FreeBSD has .asciz MACHINE_ARCH in one of its arm csu files (for an ELF note). Some architectures like to build up MACHINE_ARCH (defined in a header file) by concatenating multiple strings together for the different components. Currently arm doesn't do this (and only has two variants), but you can imagine other architectures might also want an ELF note and so would give confusing behaviour (you may not find it confusing, but I can guarantee you a large fraction of people would); if you want the C-like behaviour you have to use .ascii and add the NUL byte manually, as is done in your example.

But that's all justification for why I don't like GNU as's behaviour and why I don't like adding that feature to LLVM (I would not advocate for _different_ behaviour as that would cause even more problems, only to just not implement it at all). However, your example is a real-world case of something that really does need this feature and cannot easily be rewritten to use commas, so I see this as unavoidable and so yes, LLVM should implement this GNU as feature.

I agree with @jrtc27 that juxtaposition with .asciz has a bug-prone behavior. This is unrelated to GNU as's preprocessing stage do_scrub_chars. The Linux kernel does not need .asciz, so if we cannot get a reasonable behavior for .asciz, we can simply not allow the usage.

Thanks all for the comments. From what I read so far, we agree thatt there is a real need to implement this feature in LLVM so add more test cases based on the comments.

I mentioned the .asciz issue to binutils.... The reaction was a bit quick .. https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=3d955acb36f483c05724181da5ffba46b1303c43

I think rejecting .asciz "a" "b" may be fine for us since there is no real world usage.

Then I am not sure we need a new method parseManyWithOptionalComma. It can be specialized.

Harbormaster completed remote builds in B79169: Diff 305875.Nov 17 2020, 12:54 PM

I think rejecting .asciz "a" "b" may be fine for us since there is no real world usage.

Then I am not sure we need a new method parseManyWithOptionalComma. It can be specialized.

Does that mean we support space as separators in .ascii but not for .asciz and .string? Wouldn't that create more confusion?

nickdesaulniers added inline comments.Nov 19 2020, 12:25 PM

llvm/test/MC/AsmParser/directive_ascii.s
74	sorry, is there an extra two `00` in the output?
96	sorry, is there an extra two `00` in the output?

bcain added a subscriber: bcain.Nov 19 2020, 12:40 PM

jcai19 added inline comments.Nov 19 2020, 3:50 PM

llvm/test/MC/AsmParser/directive_ascii.s
74	This is what llvm-mc prints. I think both \0 and \000 mean 0 in octal form.

MaskRay added inline comments.Nov 19 2020, 3:54 PM

llvm/test/MC/AsmParser/directive_ascii.s
50–57	The test may be a bit too verbose in this way. One `.ascii "A", "B" "C", "D"` is sufficient for the coverage.
71	`.asciz "A", "B" "C", "D"` should produce `A\0BC\0D\0` as GNU as 2.36 will do

Simplify test cases as @MaskRay suggested.

Harbormaster completed remote builds in B79544: Diff 306562.Nov 19 2020, 5:53 PM

jcai19 retitled this revision from [AsmParser] make .ascii/.asciz/.string support multiple strings to [AsmParser] make .ascii/.asciz/.string support spaces as the separator.Nov 19 2020, 6:07 PM

jcai19 edited the summary of this revision. (Show Details)

jcai19 retitled this revision from [AsmParser] make .ascii/.asciz/.string support spaces as the separator to [AsmParser] make .ascii/.asciz/.string support spaces as separators.

jrtc27 added inline comments.Nov 19 2020, 6:29 PM

llvm/test/MC/AsmParser/directive_ascii.s
56	This no longer matches binutils and is the specific part of this patch that has been contentious.

jcai19 added inline comments.Nov 20 2020, 12:25 PM

llvm/test/MC/AsmParser/directive_ascii.s
56	Do you mean the .asciz test case? I think this is what gnu AS does. $ cat test1.s .asciz "A", "B" "C" $ gcc test1.s -c -o test1.o llvm-readelf --string-dump=.text test1.o String dump of section '.text': [ 0] A [ 2] B [ 4] C $ cat test2.s asciz "A\0B\0C" $ gcc test2.s -c -o test2.o llvm-readelf --string-dump=.text test2.o String dump of section '.text': [ 0] A [ 2] B [ 4] C $ cat test3.s asciz "ABC" $ gcc test3.s -c -o test3.o llvm-readelf --string-dump=.text test3.o String dump of section '.text': [ 0] ABC

jrtc27 added inline comments.Nov 20 2020, 12:26 PM

llvm/test/MC/AsmParser/directive_ascii.s
56	Not any more; see https://reviews.llvm.org/D91460#2400719 above.

jcai19 added inline comments.Nov 20 2020, 12:45 PM

llvm/test/MC/AsmParser/directive_ascii.s
56	Ah thanks I did miss the link. This seems like a big change and would potentially breaking existing code, @MaskRay do you have a link to the binutils bug report?

Since GNU assembler changed how it interpreted using space as separators in .asciz and .string in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=3d955acb36f483c05724181da5ffba46b1303c43, limit this patch to .ascii only as matching the new GAS behavior in the two directives may cause subtle issues when building existing code that assumes the old behavior with LLVM's integrated assembler.

jcai19 retitled this revision from [AsmParser] make .ascii/.asciz/.string support spaces as separators to [AsmParser] make .ascii support spaces as separators.Nov 30 2020, 2:53 PM

jcai19 edited the summary of this revision. (Show Details)

nickdesaulniers added inline comments.Nov 30 2020, 2:58 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3041–3042	Would a ternary expression make sense here? if (ZeroTerminated ? parseMany(parseOp) : parseManyWithOptionalComma()) return ...;

jrtc27 added inline comments.Nov 30 2020, 3:00 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006	This grammar needs updating
3024–3025	It'd be nicer to push this all up into parseOp so it just parses all the strings it finds one at a time. For `.ascii` you can get away with treating spaces and commas the same, but if we ever want to support the new `.asciz`/`.string` behaviour (which I feel we should one day, provided binutils doesn't decide to revert its behaviour) we'll need it to be like that, so it'd make it trivial to enable. Also I feel it's a cleaner (and likely simpler) way to write it regardless.

Harbormaster completed remote builds in B80586: Diff 308486.Nov 30 2020, 3:38 PM

Address comments.

Harbormaster completed remote builds in B80591: Diff 308492.Nov 30 2020, 4:39 PM

jcai19 added inline comments.Dec 3 2020, 4:24 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3024–3025	Thanks for the comment. Just to clarify, do you suggest we merge the code of parseManyWithOptionalComma into parseOp? parseOp is used as an argument to parseMany later, so if we change it, then we may have to make adjustments to parseMany too, which is used in other places. Or maybe what you suggest is that we should completely rewrite parseDirectiveAscii function and not call parseMany at all? Also I am not sure about rewriting the code assuming the new GNU behaviors will be the way going forward. I could not find any discussion on the change and its potential implications, and therefore could not confirm if that would be what GNU community agrees on.

jrtc27 added inline comments.Dec 3 2020, 4:40 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3024–3025	I mean: remove parseManyWithOptionalComma keep using parseMany(parseOp) leave parseMany unchanged make parseOp parse _all_ strings up to the first comma, effectively treating them like a single multi-token operand (i.e. keep calling parseEscapedString and emitting the string while there's still another AsmToken::String)

jcai19 added inline comments.Dec 3 2020, 5:32 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3024–3025	Thanks for the clarification. IIUC, we only change parseOp to treat space-separated strings as one string until we hit a comma or end of statement, and keep everything else unchanged. Wouldn't that also affect .asciz and .string and make them behave like the new GNU behavior?

jrtc27 added inline comments.Dec 3 2020, 5:34 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3024–3025	Yes, so for now you can add an additional check to bail out after the first iteration of the do-while loop if ZeroTerminated is true, and at some point in the future all that's needed is to delete that additional check to make .asciz and .string behave like the new GNU binutils.

Address @jrtc27's comment. paseOp function now skips all the spaces between string arguments until it hits any comma or end of statement. For now we will limit the support to .ascii only.

jrtc27 added inline comments.Dec 4 2020, 2:35 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006	Hm, this isn't particularly clear as it stands that multiple strings next to each other is valid, it just looks like the comma has been put it in brackets. Maybe this to better reflect how it's parsed (square brackets around the comma instead would also work, but doesn't capture the logical parse tree so well)? I don't know if the `"string"` needs to be put in parentheses though to apply `+` to it; is this a real BNF-esque grammar or just a hand-wavey one for human consumption?

jcai19 added inline comments.Dec 4 2020, 3:03 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006	It appears to be only for human consumption, at least I could not find anywhere it was parsed. For .ascii we could use either space or comma to achieve the same effect, which made comma optional. Based on the original text, I assume brackets mean the enclosed symbol is mandatory while parentheses means it is optional.

jrtc27 added inline comments.Dec 4 2020, 3:09 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006	No, parens are for grouping (so `*` applies to multiple items), and square brackets indicate the items are optional. https://en.wikipedia.org/wiki/Backus–Naur_form#Variants describes some of these common BNF-based grammars. In this case I think leaving the parens out of my suggestion is fine, nobody in their right mind would think the `+` means that it applies to the double quote (i.e. that, say, `"string""` is valid), and strictly speaking it should really be `'"' string '"'` or similar if you want it to be a proper formal grammar. So unless you have objections I think my suggestion is the right thing to go with :)

jcai19 added inline comments.Dec 4 2020, 3:18 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3005–3006	Okay I double checked the syntax of .ascii and you are right the string arguments are optional. Will update accordingly.

Update the syntax in the comment.

jrtc27 added inline comments.Dec 4 2020, 3:22 PM

llvm/test/MC/AsmParser/directive_ascii.s
71	Hmm, MaskRay suggested a test case with `, "D"` on the end, that's probably a good idea.

Updated a test case.

Harbormaster completed remote builds in B81172: Diff 309658.Dec 4 2020, 4:40 PM

Harbormaster completed remote builds in B81182: Diff 309678.Dec 4 2020, 6:10 PM

Harbormaster completed remote builds in B81189: Diff 309687.Dec 4 2020, 6:52 PM

reames resigned from this revision.Dec 5 2020, 2:36 PM

Just want to follow up and see if anyone might have any thoughts on the current implementation of this patch. Otherwise, is it good to submit the patch? @jrtc27 @rnk @nickdesaulniers @MaskRay

jrtc27 added inline comments.Dec 17 2020, 1:42 PM

llvm/test/MC/AsmParser/directive_ascii.s
52–56	I don't think I understand why these get printed with `.byte` but others with `.ascii`. I think it'd be important to make these use CHECK-NEXT (and CHECK-LABEL) though to assert that there aren't 0s sneaking in (which this test is poor at, the existing cases are not well-written).

Update the test case.

jcai19 added inline comments.Dec 17 2020, 2:41 PM

llvm/test/MC/AsmParser/directive_ascii.s
52–56	< I don't think I understand why these get printed with .byte but others with .ascii It seems llvm-mc prints ascii code for single bytes, as shown in other test cases. Or perhaps your question is about the implementation of llvm-mc? < I think it'd be important to make these use CHECK-NEXT (and CHECK-LABEL) though to assert that there aren't 0s sneaking in (which this test is poor at, the existing cases are not well-written). Updated. Thanks.

jrtc27 added inline comments.Dec 17 2020, 2:50 PM

llvm/test/MC/AsmParser/directive_ascii.s
52–56	Ok that makes sense. In an ideal world this would be `.byte 65 ; .ascii "BC" ; .byte 68` then but that's purely cosmetic and not worth wasting time implementing (you'd have to concatenate to Data in the new parse loop and then emit it all at once at the end of the loop).

Looks good to me, let's see what the others think.

This revision is now accepted and ready to land.Dec 17 2020, 2:50 PM

In D91460#2461621, @jrtc27 wrote:

Looks good to me, let's see what the others think.

Thanks. Will leave it open for another 24 hours for comments.

Harbormaster completed remote builds in B82865: Diff 312615.Dec 17 2020, 3:16 PM

Thanks Jian! (I can get a little further assembling the Linux kernel's arch/arm/probes/kprobes/test-arm.o (allyesconfig) with Clang's integrated assembler with this; though that file now has additional failures unrelated to this patch.

This revision was landed with ongoing or failed builds.Dec 20 2020, 10:45 PM

Closed by commit rGe0963ae274be: [AsmParser] make .ascii support spaces as separators (authored by jcai19). · Explain Why

This revision was automatically updated to reflect the committed changes.

jcai19 added a commit: rGe0963ae274be: [AsmParser] make .ascii support spaces as separators.

jcai19 added a reverting change: rG82c4153e66fa: Revert "[AsmParser] make .ascii support spaces as separators".Jan 13 2021, 2:38 PM

Which GDB tests? How are they broken? I'm extremely surprised by the given this patch implements what binutils has done for a very long time (only .asciz/.string have changed behaviour in binutils, and those are not modified by this patch here so should not have regressions).

This seems like a premature revert and probably just causes churn to the LLVM upstream to me...The last revision looks good to me and resolves a GNU as/LLVM integrated assembler difference which we agreed the integrated assembler should improve in that case.

Upstream gdb developers don't build Clang. Many tests may be written in a GCC/binutils dependent way. The issue is not necessarily the integrated assembler issue. It can well be in another place we haven't noticed yet. If indeed that way, internally we can just disable the gdb tests. If there is a Clang 11 vs Clang 12 difference. We could make the upstream gdb tests more amenable to more compiler versions.

MaskRay reopened this revision.Jan 13 2021, 2:56 PM

This revision is now accepted and ready to land.Jan 13 2021, 2:56 PM

In D91460#2496849, @jrtc27 wrote:

Which GDB tests? How are they broken? I'm extremely surprised by the given this patch implements what binutils has done for a very long time (only .asciz/.string have changed behaviour in binutils, and those are not modified by this patch here so should not have regressions).

The reason is unclear yet and may as well not be caused by this change as @MaskRay pointed out. Will reland or update the patch once we know the reason.

MaskRay added inline comments.Jan 13 2021, 9:53 PM

llvm/lib/MC/MCParser/AsmParser.cpp
3013	A complete sentence needs a period.

The test failure turned out to be a false positive. Relanded the change on https://github.com/llvm/llvm-project/commit/9dfeec853008109b1cbe926c22675c96226040d9

Revision Contents

Path

Size

llvm/

lib/

MC/

MCParser/

AsmParser.cpp

13 lines

test/

MC/

AsmParser/

directive_ascii.s

7 lines

Diff 309678

llvm/lib/MC/MCParser/AsmParser.cpp

Show First 20 Lines • Show All 2,996 Lines • ▼ Show 20 Lines if (isAngleBracketString(StartLoc, EndLoc)) {

Data = angleBracketString(StringRef(StartChar, EndChar - StartChar)); Data = angleBracketString(StringRef(StartChar, EndChar - StartChar));

return false; return false;

} }

return true; return true;

} }

/// parseDirectiveAscii: /// parseDirectiveAscii:

/// ::= ( .ascii | .asciz | .string ) [ "string" ( , "string" )* ] // ::= .ascii [ "string"+ ( , "string"+ )* ]

/// ::= ( .asciz | .string ) [ "string" ( , "string" )* ]

jrtc27Unsubmitted

Not Done

This grammar needs updating

jrtc27: This grammar needs updating

jrtc27Unsubmitted

Not Done

/// parseDirectiveAscii:

- /// ::= .ascii [ "string" ( ( , ) "string" )* ]

+ /// ::= .ascii [ "string"+ ( , "string"+ )* ]

/// ::= ( .asciz | .string ) [ "string" ( , "string" )* ]

Hm, this isn't particularly clear as it stands that multiple strings next to each other is valid, it just looks like the comma has been put it in brackets. Maybe this to better reflect how it's parsed (square brackets around the comma instead would also work, but doesn't capture the logical parse tree so well)? I don't know if the "string" needs to be put in parentheses though to apply + to it; is this a real BNF-esque grammar or just a hand-wavey one for human consumption?

jrtc27: Hm, this isn't particularly clear as it stands that multiple strings next to each other is…

jcai19AuthorUnsubmitted

Done

It appears to be only for human consumption, at least I could not find anywhere it was parsed. For .ascii we could use either space or comma to achieve the same effect, which made comma optional. Based on the original text, I assume brackets mean the enclosed symbol is mandatory while parentheses means it is optional.

jcai19: It appears to be only for human consumption, at least I could not find anywhere it was parsed.

jrtc27Unsubmitted

Not Done

No, parens are for grouping (so * applies to multiple items), and square brackets indicate the items are optional. https://en.wikipedia.org/wiki/Backus–Naur_form#Variants describes some of these common BNF-based grammars. In this case I think leaving the parens out of my suggestion is fine, nobody in their right mind would think the + means that it applies to the double quote (i.e. that, say, "string"" is valid), and strictly speaking it should really be '"' string '"' or similar if you want it to be a proper formal grammar. So unless you have objections I think my suggestion is the right thing to go with :)

jrtc27: No, parens are for grouping (so `*` applies to multiple items), and square brackets indicate…

jcai19AuthorUnsubmitted

Done

Okay I double checked the syntax of .ascii and you are right the string arguments are optional. Will update accordingly.

jcai19: Okay I double checked the syntax of .ascii and you are right the string arguments are optional.

bool AsmParser::parseDirectiveAscii(StringRef IDVal, bool ZeroTerminated) { bool AsmParser::parseDirectiveAscii(StringRef IDVal, bool ZeroTerminated) {

auto parseOp = [&]() -> bool { auto parseOp = [&]() -> bool {

std::string Data; std::string Data;

if (checkForValidSection() || parseEscapedString(Data)) if (checkForValidSection())

return true;

// Only support spaces as separators for .ascii directive for now. See the

// discusssion at https://reviews.llvm.org/D91460 for more details

MaskRayUnsubmitted

Done

A complete sentence needs a period.

MaskRay: A complete sentence needs a period.

do {

if (parseEscapedString(Data))

return true; return true;

getStreamer().emitBytes(Data); getStreamer().emitBytes(Data);

} while (!ZeroTerminated && getTok().is(AsmToken::String));

if (ZeroTerminated) if (ZeroTerminated)

getStreamer().emitBytes(StringRef("\0", 1)); getStreamer().emitBytes(StringRef("\0", 1));

return false; return false;

}; };

if (parseMany(parseOp)) if (parseMany(parseOp))

return addErrorSuffix(" in '" + Twine(IDVal) + "' directive"); return addErrorSuffix(" in '" + Twine(IDVal) + "' directive");

jrtc27Unsubmitted

Not Done

It'd be nicer to push this all up into parseOp so it just parses all the strings it finds one at a time. For .ascii you can get away with treating spaces and commas the same, but if we ever want to support the new .asciz/.string behaviour (which I feel we should one day, provided binutils doesn't decide to revert its behaviour) we'll need it to be like that, so it'd make it trivial to enable. Also I feel it's a cleaner (and likely simpler) way to write it regardless.

jrtc27: It'd be nicer to push this all up into parseOp so it just parses all the strings it finds one…

jcai19AuthorUnsubmitted

Done

Thanks for the comment. Just to clarify, do you suggest we merge the code of parseManyWithOptionalComma into parseOp? parseOp is used as an argument to parseMany later, so if we change it, then we may have to make adjustments to parseMany too, which is used in other places. Or maybe what you suggest is that we should completely rewrite parseDirectiveAscii function and not call parseMany at all? Also I am not sure about rewriting the code assuming the new GNU behaviors will be the way going forward. I could not find any discussion on the change and its potential implications, and therefore could not confirm if that would be what GNU community agrees on.

jcai19: Thanks for the comment. Just to clarify, do you suggest we merge the code of…

jrtc27Unsubmitted

Not Done

I mean:

remove parseManyWithOptionalComma
keep using parseMany(parseOp)
leave parseMany unchanged
make parseOp parse _all_ strings up to the first comma, effectively treating them like a single multi-token operand (i.e. keep calling parseEscapedString and emitting the string while there's still another AsmToken::String)

jrtc27: I mean: - remove parseManyWithOptionalComma - keep using parseMany(parseOp) - leave parseMany…

jcai19AuthorUnsubmitted

Done

Thanks for the clarification. IIUC, we only change parseOp to treat space-separated strings as one string until we hit a comma or end of statement, and keep everything else unchanged. Wouldn't that also affect .asciz and .string and make them behave like the new GNU behavior?

jcai19: Thanks for the clarification. IIUC, we only change parseOp to treat space-separated strings as…

jrtc27Unsubmitted

Not Done

Yes, so for now you can add an additional check to bail out after the first iteration of the do-while loop if ZeroTerminated is true, and at some point in the future all that's needed is to delete that additional check to make .asciz and .string behave like the new GNU binutils.

jrtc27: Yes, so for now you can add an additional check to bail out after the first iteration of the do…

return false; return false;

jrtc27Unsubmitted

Not Done

return false;

};

- while (true) {

- if (parseOptionalToken(AsmToken::EndOfStatement)) // return true on success

- break;

+ while (!parseOptionalToken(AsmToken::EndOfStatement)) {

if (parseOp()) // return false on success

jrtc27:

} }

jrtc27Unsubmitted

Not Done

Should be "returns" not "return" otherwise it means something very different, though I'm not sure this is a particularly useful comment. You could also consider inverting the return value of the parseOp lambda above to be more intuitive.

jrtc27: Should be "returns" not "return" otherwise it means something very different, though I'm not…

/// parseDirectiveReloc /// parseDirectiveReloc

/// ::= .reloc expression , identifier [ , expression ] /// ::= .reloc expression , identifier [ , expression ]

bool AsmParser::parseDirectiveReloc(SMLoc DirectiveLoc) { bool AsmParser::parseDirectiveReloc(SMLoc DirectiveLoc) {

const MCExpr *Offset; const MCExpr *Offset;

const MCExpr *Expr = nullptr; const MCExpr *Expr = nullptr;

SMLoc OffsetLoc = Lexer.getTok().getLoc(); SMLoc OffsetLoc = Lexer.getTok().getLoc();

if (parseExpression(Offset)) if (parseExpression(Offset))

return true; return true;

if (parseToken(AsmToken::Comma, "expected comma") || if (parseToken(AsmToken::Comma, "expected comma") ||

check(getTok().isNot(AsmToken::Identifier), "expected relocation name")) check(getTok().isNot(AsmToken::Identifier), "expected relocation name"))

return true; return true;

SMLoc NameLoc = Lexer.getTok().getLoc(); SMLoc NameLoc = Lexer.getTok().getLoc();

nickdesaulniersUnsubmitted

Not Done

Would a ternary expression make sense here?

if (ZeroTerminated ? parseMany(parseOp) : parseManyWithOptionalComma())
  return ...;

nickdesaulniers: Would a ternary expression make sense here? ``` if (ZeroTerminated ? parseMany(parseOp)…

StringRef Name = Lexer.getTok().getIdentifier(); StringRef Name = Lexer.getTok().getIdentifier();

Lex(); Lex();

if (Lexer.is(AsmToken::Comma)) { if (Lexer.is(AsmToken::Comma)) {

Lex(); Lex();

SMLoc ExprLoc = Lexer.getLoc(); SMLoc ExprLoc = Lexer.getLoc();

if (parseExpression(Expr)) if (parseExpression(Expr))

return true; return true;

▲ Show 20 Lines • Show All 3,105 Lines • Show Last 20 Lines

llvm/test/MC/AsmParser/directive_ascii.s

	Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines

	# CHECK: TEST7:			# CHECK: TEST7:
	# CHECK: .ascii "dk"			# CHECK: .ascii "dk"
	# 0xFACE & 0xFF == 0xCE == 0o316			# 0xFACE & 0xFF == 0xCE == 0o316
	# 0x0FE & 0xFF == 0xFE == 0o376			# 0x0FE & 0xFF == 0xFE == 0o376
	# CHECK: .ascii "\316\376"			# CHECK: .ascii "\316\376"
	TEST7:			TEST7:
	.ascii "\x64\Xa6B"			.ascii "\x64\Xa6B"
	.ascii "\xface\x0Fe"			.ascii "\xface\x0Fe"

				# CHECK: TEST8:
				# CHECK: .byte 65
				# CHECK: .byte 66
				# CHECK: .byte 67
				TEST8:
				jrtc27Unsubmitted Not Done Reply Inline Actions This no longer matches binutils and is the specific part of this patch that has been contentious. jrtc27: This no longer matches binutils and is the specific part of this patch that has been…
				jcai19AuthorUnsubmitted Done Reply Inline Actions Do you mean the .asciz test case? I think this is what gnu AS does. $ cat test1.s .asciz "A", "B" "C" $ gcc test1.s -c -o test1.o llvm-readelf --string-dump=.text test1.o String dump of section '.text': [ 0] A [ 2] B [ 4] C $ cat test2.s asciz "A\0B\0C" $ gcc test2.s -c -o test2.o llvm-readelf --string-dump=.text test2.o String dump of section '.text': [ 0] A [ 2] B [ 4] C $ cat test3.s asciz "ABC" $ gcc test3.s -c -o test3.o llvm-readelf --string-dump=.text test3.o String dump of section '.text': [ 0] ABC jcai19: Do you mean the .asciz test case? I think this is what gnu AS does. $ cat test1.s .asciz "A"…
				jrtc27Unsubmitted Not Done Reply Inline Actions Not any more; see https://reviews.llvm.org/D91460#2400719 above. jrtc27: Not any more; see https://reviews.llvm.org/D91460#2400719 above.
				jcai19AuthorUnsubmitted Done Reply Inline Actions Ah thanks I did miss the link. This seems like a big change and would potentially breaking existing code, @MaskRay do you have a link to the binutils bug report? jcai19: Ah thanks I did miss the link. This seems like a big change and would potentially breaking…
				jrtc27Unsubmitted Not Done Reply Inline Actions I don't think I understand why these get printed with `.byte` but others with `.ascii`. I think it'd be important to make these use CHECK-NEXT (and CHECK-LABEL) though to assert that there aren't 0s sneaking in (which this test is poor at, the existing cases are not well-written). jrtc27: I don't think I understand why these get printed with `.byte` but others with `.ascii`. I think…
				jcai19AuthorUnsubmitted Done Reply Inline Actions < I don't think I understand why these get printed with .byte but others with .ascii It seems llvm-mc prints ascii code for single bytes, as shown in other test cases. Or perhaps your question is about the implementation of llvm-mc? < I think it'd be important to make these use CHECK-NEXT (and CHECK-LABEL) though to assert that there aren't 0s sneaking in (which this test is poor at, the existing cases are not well-written). Updated. Thanks. jcai19: < I don't think I understand why these get printed with .byte but others with .ascii It seems…
				jrtc27Unsubmitted Not Done Reply Inline Actions Ok that makes sense. In an ideal world this would be `.byte 65 ; .ascii "BC" ; .byte 68` then but that's purely cosmetic and not worth wasting time implementing (you'd have to concatenate to Data in the new parse loop and then emit it all at once at the end of the loop). jrtc27: Ok that makes sense. In an ideal world this would be `.byte 65 ; .ascii "BC" ; .byte 68` then…
				.ascii "A", "B" "C"
				nickdesaulniersUnsubmitted Not Done Reply Inline Actions sorry, is there an extra two `00` in the output? nickdesaulniers: sorry, is there an extra two `00` in the output?
				jcai19AuthorUnsubmitted Done Reply Inline Actions This is what llvm-mc prints. I think both \0 and \000 mean 0 in octal form. jcai19: This is what llvm-mc prints. I think both \0 and \000 mean 0 in octal form.
				nickdesaulniersUnsubmitted Not Done Reply Inline Actions sorry, is there an extra two `00` in the output? nickdesaulniers: sorry, is there an extra two `00` in the output?
				MaskRayUnsubmitted Not Done Reply Inline Actions The test may be a bit too verbose in this way. One `.ascii "A", "B" "C", "D"` is sufficient for the coverage. MaskRay: The test may be a bit too verbose in this way. One `.ascii "A", "B" "C", "D"` is sufficient…
				MaskRayUnsubmitted Not Done Reply Inline Actions `.asciz "A", "B" "C", "D"` should produce `A\0BC\0D\0` as GNU as 2.36 will do MaskRay: `.asciz "A", "B" "C", "D"` should produce `A\0BC\0D\0` as GNU as 2.36 will do
				jrtc27Unsubmitted Not Done Reply Inline Actions Hmm, MaskRay suggested a test case with `, "D"` on the end, that's probably a good idea. jrtc27: Hmm, MaskRay suggested a test case with `, "D"` on the end, that's probably a good idea.

This is an archive of the discontinued LLVM Phabricator instance.

[AsmParser] make .ascii support spaces as separatorsClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 309678

llvm/lib/MC/MCParser/AsmParser.cpp

llvm/test/MC/AsmParser/directive_ascii.s

[AsmParser] make .ascii support spaces as separators
ClosedPublic