This patch improves the assembly output produced for string literals by using character literals in byte lists. This provides the benefits of having printable characters appear as such in the assembly output and of having strings kept as logical units on the same line.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I'm sure this may have been considered, but given that .byte also supports a limited string syntax (given with the limitation that it can support ), is there a reason we can't just glue consecutive runs of printable characters together?
e.g. .byte 'h,'e,'l,'l,'o,' ,'w,'o,'r,'l,'d,'!,0012,0000 becomes:
.byte 'hello world!' .byte 0012,0000
(Essentially mode shift between writing printable as a string and the comma seperated octal form, and just emitting a new .byte directive when we encounter the other type)
Yes, the patch description says consideration was given towards "having strings kept as logical units on the same line". This is beneficial for the purposes of being able to grep even when sequences mix "printable" and "non-printable" characters. For example, the UTF-8 code unit sequence for e with acute would likely end up on a line of its own if the alternative you mention is used. For a program with strings in French, I imagine that could be rather annoying.
| llvm/lib/MC/MCAsmStreamer.cpp | ||
|---|---|---|
| 1052–1053 | Why not ward off trouble below and just make the check reflect whether AsciizDirective is actually usable in this case upfront? Data.size() == 1 || !(MAI->getAsciiDirective() || MAI->getByteListDirective() || MAI->getAsciizDirective() && Data.back() == 0) | |
| 1080 | See above comment. | |
- Address comment: Avoid situation where assert arises; reduce duplication
| llvm/lib/MC/MCAsmStreamer.cpp | ||
|---|---|---|
| 1052–1053 | I've refactored to remove the duplication of the conditions. | |
| llvm/lib/MC/MCAsmStreamer.cpp | ||
|---|---|---|
| 986 | May need to hedge against -Werror,-Wformat-security warning as seen recently on another commit. | |
| llvm/lib/MC/MCAsmStreamer.cpp | ||
|---|---|---|
| 986 | I'm assuming specifically we're worried about -Wformat-nonliteral. I think the MAI property supporting other arbitrary character formatting strings is a nice to have, but not essential to this patch, since AIX is currently the only user. We could simplify to the just supporting a character prefix (single quote in our case) or something like that so we can use a fixed format for the character itself. | |
- Update to avoid stored format string by using switch on a enum value
| llvm/lib/MC/MCAsmStreamer.cpp | ||
|---|---|---|
| 986 | @daltenty, the patch has been updated to encode the known (and used) formats into an enumeration type. The implementation of the formatting itself is selected via a switch, which passes different lambdas depending on the case chosen. | |
LGTM, thanks!
| llvm/lib/MC/MCAsmStreamer.cpp | ||
|---|---|---|
| 983 | nit: the name of this lambda isn't very descriptive, but I don't really have a better suggestion | |
nit: the name of this lambda isn't very descriptive, but I don't really have a better suggestion