Currently the integrated assembler only allows commas as the separator
between string arguments in .ascii. This patch adds support to using
space as separators and make IAS consistent with GNU assembler.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This needs tests, both to show it accepts multiple strings and to show for each of the three directives that it produces the right output (i.e. is equivalent to multiple individual directives).
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3031–3032 | ||
3034 | Should be "returns" not "return" otherwise it means something very different, though I'm not sure this is a particularly useful comment. You could also consider inverting the return value of the parseOp lambda above to be more intuitive. |
Thanks for the reminder. I did some local testing but forgot to add tests and was about to leave a comment here that I would add them soon.
It appears that I misunderstood the existing code. It actually supports multiple string arguments, but only when commas are used between the arguments. GAS supports commas or spaces as the separator (or when both are used in the same call such as .ascii "aa", "bb" "cc ). Will refactor my patch and update the commit message.
In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?
In which case I assume .asciz "foo" "bar" is equivalent to .asciz "foobar" rather than .asciz "foo\0bar" (same for .string), just like C string concatenation, and hence why both syntaxes exist?
I think GNU assembler would produce the same disassembly with .asciz "foo" "bar" and .asciz "foo", "bar".
$ cat test.s .asciz "foo" "bar" .asciz "foo", "bar" $ arm-linux-gnueabihf-gcc test.s -c -o test.o $ arm-linux-gnueabihf-objdump -dr test.o gcc.o: file format elf32-littlearm Disassembly of section .text: 00000000 <.text>: 0: 006f6f66 .word 0x006f6f66 4: 00726162 .word 0x00726162 8: 006f6f66 .word 0x006f6f66 c: 00726162 .word 0x00726162
In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?
In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for .asciz? You can't write .word 2 2, only .word 2, 2, so why do string directives really need special treatment?
I agree only accepting comma as the separators would be clearer in newer code. On the other hand, adding the compatibility with GAS would allow IAS to support more legacy code base (although I'd admit I'm not sure how widely-used separating strings with space in these directives is). @nickdesaulniers Any thoughts and comments?
https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like .ascii and .asciz work differently in this regard?
https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like .ascii and .asciz work differently in this regard?
So I played with the two directives a little bit, and I think they both treat a call with multiple string arguments the same as multiple calls with one argument. For example, .ascii produces the same disassembly with "foo" "bar", "foo", "bar" or "foobar" sine the directive does not append anything after string, while .asciz appends \0 and turns "foo" "bar" or "foo", "bar" into "foo\0bar" .
With .ascii:
$ cat test_space.s .ascii "foo" "bar" $ cat test_comma.s .ascii "foo", "bar" $ cat test_concatenation.s .ascii "foobar" $ gcc test_space.s -c -o test_space.o $ objdump -dr test_space.o test_space.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 66 6f outsw %ds:(%rsi),(%dx) 2: 6f outsl %ds:(%rsi),(%dx) 3: 62 .byte 0x62 4: 61 (bad) 5: 72 .byte 0x72 $ gcc test_comma.s -c -o test_comma.o $ objdump -dr test_comma.o test_comma.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 66 6f outsw %ds:(%rsi),(%dx) 2: 6f outsl %ds:(%rsi),(%dx) 3: 62 .byte 0x62 4: 61 (bad) 5: 72 .byte 0x72 $ gcc test_concatenation.s -c -o test_concatenation.o $ objdump -dr test_concatenation.o test_concatenation.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 66 6f outsw %ds:(%rsi),(%dx) 2: 6f outsl %ds:(%rsi),(%dx) 3: 62 .byte 0x62 4: 61 (bad) 5: 72 .byte 0x72 4: Address 0x0000000000000004 is out of bounds.
With .asciz:
$ cat test_space.s .asciz "foo" "bar" $ cat test_comma.s .asciz "foo", "bar" $ cat test_concatenation.s .asciz "foo\0bar" $ gcc test_space.s -c -o test_space.o $ objdump -dr test_space.o test_space.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 66 6f outsw %ds:(%rsi),(%dx) 2: 6f outsl %ds:(%rsi),(%dx) 3: 00 62 61 add %ah,0x61(%rdx) 6: 72 00 jb 0x8 $ gcc test_comma.s -c -o test_comma.o $ objdump -dr test_comma.o test_comma.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 66 6f outsw %ds:(%rsi),(%dx) 2: 6f outsl %ds:(%rsi),(%dx) 3: 00 62 61 add %ah,0x61(%rdx) 6: 72 00 jb 0x8 $ gcc test_concatenation.s -c -o test_concatenation.o $ objdump -dr test_concatenation.o Disassembly of section .text: 0000000000000000 <.text>: 0: 66 6f outsw %ds:(%rsi),(%dx) 2: 6f outsl %ds:(%rsi),(%dx) 3: 00 62 61 add %ah,0x61(%rdx) 6: 72 00 jb 0x8
This looks like confirmation of what I said.
With .ascii:
$ cat test_space.s .ascii "foo" "bar" $ cat test_comma.s .ascii "foo", "bar" $ cat test_concatenation.s .ascii "foobar" $ gcc test_space.s -c -o test_space.o $ objdump -dr test_space.o
You can use llvm-readelf --string-dump=.text test_space.o to interpret the .text section as C style strings, though that might not be helpful for the .ascii examples.
I disagree. That's why .ascii is distinct from .asciz; if developers do not want NUL-terminated C style strings, they should use .ascii and not .asciz. The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU as such that clang can be used as a substitute. Not matching the behavior of GNU as precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.
@jcai19 this patch still needs tests for this new change of behavior to the parser. You have good test cases in your comments; those should be added into this patch itself.
IMO matching C preprocessor behavior isn't enough of a compelling reason to diverge from gas's behavior here. Nick asked for more tests, and we could stand to test each directive with the new behavior. Code looks good to me.
I disagree. That's why .ascii is distinct from .asciz; if developers do not want NUL-terminated C style strings, they should use .ascii and not .asciz. The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU as such that clang can be used as a substitute. Not matching the behavior of GNU as precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.
Well, no, it is confusing for someone who doesn't know about the GNU syntax. If I see .asciz "foo" "bar" I'd assume it's equivalent to .asciz "foobar" not .asciz "foo"; .asciz "bar" and that .asciz is being used so there's a NUL on the end of the concatenated string.
FreeBSD has .asciz MACHINE_ARCH in one of its arm csu files (for an ELF note). Some architectures like to build up MACHINE_ARCH (defined in a header file) by concatenating multiple strings together for the different components. Currently arm doesn't do this (and only has two variants), but you can imagine other architectures might also want an ELF note and so would give confusing behaviour (you may not find it confusing, but I can guarantee you a large fraction of people would); if you want the C-like behaviour you have to use .ascii and add the NUL byte manually, as is done in your example.
But that's all justification for why I don't like GNU as's behaviour and why I don't like adding that feature to LLVM (I would not advocate for _different_ behaviour as that would cause even more problems, only to just not implement it at all). However, your example is a real-world case of something that really does need this feature and cannot easily be rewritten to use commas, so I see this as unavoidable and so yes, LLVM should implement this GNU as feature.
I agree with @jrtc27 that juxtaposition with .asciz has a bug-prone behavior. This is unrelated to GNU as's preprocessing stage do_scrub_chars. The Linux kernel does not need .asciz, so if we cannot get a reasonable behavior for .asciz, we can simply not allow the usage.
Thanks all for the comments. From what I read so far, we agree thatt there is a real need to implement this feature in LLVM so add more test cases based on the comments.
I mentioned the .asciz issue to binutils.... The reaction was a bit quick .. https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=3d955acb36f483c05724181da5ffba46b1303c43
I think rejecting .asciz "a" "b" may be fine for us since there is no real world usage.
Then I am not sure we need a new method parseManyWithOptionalComma. It can be specialized.
I think rejecting .asciz "a" "b" may be fine for us since there is no real world usage.
Then I am not sure we need a new method parseManyWithOptionalComma. It can be specialized.
Does that mean we support space as separators in .ascii but not for .asciz and .string? Wouldn't that create more confusion?
llvm/test/MC/AsmParser/directive_ascii.s | ||
---|---|---|
74 | This is what llvm-mc prints. I think both \0 and \000 mean 0 in octal form. |
llvm/test/MC/AsmParser/directive_ascii.s | ||
---|---|---|
56 | This no longer matches binutils and is the specific part of this patch that has been contentious. |
llvm/test/MC/AsmParser/directive_ascii.s | ||
---|---|---|
56 | Do you mean the .asciz test case? I think this is what gnu AS does. $ cat test1.s $ gcc test1.s -c -o test1.o $ cat test2.s $ gcc test2.s -c -o test2.o $ cat test3.s $ gcc test3.s -c -o test3.o |
llvm/test/MC/AsmParser/directive_ascii.s | ||
---|---|---|
56 | Not any more; see https://reviews.llvm.org/D91460#2400719 above. |
Since GNU assembler changed how it interpreted using space as separators in .asciz and .string in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=3d955acb36f483c05724181da5ffba46b1303c43, limit this patch to .ascii only as matching the new GAS behavior in the two directives may cause subtle issues when building existing code that assumes the old behavior with LLVM's integrated assembler.
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3047–3048 | Would a ternary expression make sense here? if (ZeroTerminated ? parseMany(parseOp) : parseManyWithOptionalComma()) return ...; |
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3011–3012 | This grammar needs updating | |
3030–3031 | It'd be nicer to push this all up into parseOp so it just parses all the strings it finds one at a time. For .ascii you can get away with treating spaces and commas the same, but if we ever want to support the new .asciz/.string behaviour (which I feel we should one day, provided binutils doesn't decide to revert its behaviour) we'll need it to be like that, so it'd make it trivial to enable. Also I feel it's a cleaner (and likely simpler) way to write it regardless. |
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3030–3031 | Thanks for the comment. Just to clarify, do you suggest we merge the code of parseManyWithOptionalComma into parseOp? parseOp is used as an argument to parseMany later, so if we change it, then we may have to make adjustments to parseMany too, which is used in other places. Or maybe what you suggest is that we should completely rewrite parseDirectiveAscii function and not call parseMany at all? Also I am not sure about rewriting the code assuming the new GNU behaviors will be the way going forward. I could not find any discussion on the change and its potential implications, and therefore could not confirm if that would be what GNU community agrees on. |
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3030–3031 | I mean:
|
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3030–3031 | Thanks for the clarification. IIUC, we only change parseOp to treat space-separated strings as one string until we hit a comma or end of statement, and keep everything else unchanged. Wouldn't that also affect .asciz and .string and make them behave like the new GNU behavior? |
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3030–3031 | Yes, so for now you can add an additional check to bail out after the first iteration of the do-while loop if ZeroTerminated is true, and at some point in the future all that's needed is to delete that additional check to make .asciz and .string behave like the new GNU binutils. |
Address @jrtc27's comment. paseOp function now skips all the spaces between string arguments until it hits any comma or end of statement. For now we will limit the support to .ascii only.
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3011–3012 | Hm, this isn't particularly clear as it stands that multiple strings next to each other is valid, it just looks like the comma has been put it in brackets. Maybe this to better reflect how it's parsed (square brackets around the comma instead would also work, but doesn't capture the logical parse tree so well)? I don't know if the "string" needs to be put in parentheses though to apply + to it; is this a real BNF-esque grammar or just a hand-wavey one for human consumption? |
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3011–3012 | It appears to be only for human consumption, at least I could not find anywhere it was parsed. For .ascii we could use either space or comma to achieve the same effect, which made comma optional. Based on the original text, I assume brackets mean the enclosed symbol is mandatory while parentheses means it is optional. |
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3011–3012 | No, parens are for grouping (so * applies to multiple items), and square brackets indicate the items are optional. https://en.wikipedia.org/wiki/Backus–Naur_form#Variants describes some of these common BNF-based grammars. In this case I think leaving the parens out of my suggestion is fine, nobody in their right mind would think the + means that it applies to the double quote (i.e. that, say, "string"" is valid), and strictly speaking it should really be '"' string '"' or similar if you want it to be a proper formal grammar. So unless you have objections I think my suggestion is the right thing to go with :) |
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3011–3012 | Okay I double checked the syntax of .ascii and you are right the string arguments are optional. Will update accordingly. |
llvm/test/MC/AsmParser/directive_ascii.s | ||
---|---|---|
71 | Hmm, MaskRay suggested a test case with , "D" on the end, that's probably a good idea. |
Just want to follow up and see if anyone might have any thoughts on the current implementation of this patch. Otherwise, is it good to submit the patch? @jrtc27 @rnk @nickdesaulniers @MaskRay
llvm/test/MC/AsmParser/directive_ascii.s | ||
---|---|---|
52–56 | I don't think I understand why these get printed with .byte but others with .ascii. I think it'd be important to make these use CHECK-NEXT (and CHECK-LABEL) though to assert that there aren't 0s sneaking in (which this test is poor at, the existing cases are not well-written). |
llvm/test/MC/AsmParser/directive_ascii.s | ||
---|---|---|
52–56 | < I don't think I understand why these get printed with .byte but others with .ascii It seems llvm-mc prints ascii code for single bytes, as shown in other test cases. Or perhaps your question is about the implementation of llvm-mc? < I think it'd be important to make these use CHECK-NEXT (and CHECK-LABEL) though to assert that there aren't 0s sneaking in (which this test is poor at, the existing cases are not well-written). Updated. Thanks. |
llvm/test/MC/AsmParser/directive_ascii.s | ||
---|---|---|
52–56 | Ok that makes sense. In an ideal world this would be .byte 65 ; .ascii "BC" ; .byte 68 then but that's purely cosmetic and not worth wasting time implementing (you'd have to concatenate to Data in the new parse loop and then emit it all at once at the end of the loop). |
Thanks Jian! (I can get a little further assembling the Linux kernel's arch/arm/probes/kprobes/test-arm.o (allyesconfig) with Clang's integrated assembler with this; though that file now has additional failures unrelated to this patch.
Which GDB tests? How are they broken? I'm extremely surprised by the given this patch implements what binutils has done for a very long time (only .asciz/.string have changed behaviour in binutils, and those are not modified by this patch here so should not have regressions).
This seems like a premature revert and probably just causes churn to the LLVM upstream to me...The last revision looks good to me and resolves a GNU as/LLVM integrated assembler difference which we agreed the integrated assembler should improve in that case.
Upstream gdb developers don't build Clang. Many tests may be written in a GCC/binutils dependent way. The issue is not necessarily the integrated assembler issue. It can well be in another place we haven't noticed yet. If indeed that way, internally we can just disable the gdb tests. If there is a Clang 11 vs Clang 12 difference. We could make the upstream gdb tests more amenable to more compiler versions.
The reason is unclear yet and may as well not be caused by this change as @MaskRay pointed out. Will reland or update the patch once we know the reason.
llvm/lib/MC/MCParser/AsmParser.cpp | ||
---|---|---|
3019 | A complete sentence needs a period. |
The test failure turned out to be a false positive. Relanded the change on https://github.com/llvm/llvm-project/commit/9dfeec853008109b1cbe926c22675c96226040d9
This grammar needs updating