This is an archive of the discontinued LLVM Phabricator instance.

[llvm-mc] Add --doc-id=<id> to support multiple documents in a file
AbandonedPublic

Authored by MaskRay on Jul 13 2020, 3:32 PM.

Details

Summary

llvm-mc tests tend to (a) split into multiple files or (b) combine too
much stuff in one file if people don't like splitting.

yaml2doc supports --docnum=<num> to allow multiple documents in a file.
Combing tests prudently can improve readability. This patch adds a similar
--doc-id=<id> to llvm-mc.

Usage: llvm-mc --doc-id=aa %s

#-- aa
test aa
#-- bb
test bb

Diff Detail

Event Timeline

MaskRay created this revision.Jul 13 2020, 3:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2020, 3:32 PM

yaml2doc supports --docnum=<num> to allow multiple documents in a file. Combing tests prudently can improve readability. This patch adds a similar --doc-id=<id> to llvm-mc.

Presumably "doc" makes sense in the yaml2doc context - perhaps "obj" would make more sense in llvm-mc

But I'm not sure it's the best direction - we have lots of input files (testing IR, testing C++, etc) where we don't have this ability to have multiple distinct inputs in one file & it seems to work OK? Is there something different about llvm-mc that motivates this feature here especially?

MaskRay added a comment.EditedJul 13 2020, 4:42 PM

yaml2doc supports --docnum=<num> to allow multiple documents in a file. Combing tests prudently can improve readability. This patch adds a similar --doc-id=<id> to llvm-mc.

I am open to an alternative to "doc". llvm-mc supports both assembly and disassembly (both are in textual forms). I'd expect "obj" to refer to a binary form input, which isn't the case for llvm-mc.

Presumably "doc" makes sense in the yaml2doc context - perhaps "obj" would make more sense in llvm-mc

But I'm not sure it's the best direction - we have lots of input files (testing IR, testing C++, etc) where we don't have this ability to have multiple distinct inputs in one file & it seems to work OK? Is there something different about llvm-mc that motivates this feature here especially?

If we want to test local things (how IR is lowered, how instructions are assembled, how bytes are disassembled), whatever the containing function is named does not matter. Say, we have a function foo, and we want to test a variant of it, we can just add another function bar, duplicating the content in foo and adding some variance.

However, this works poorly for some "global resources". For example, if we want to test the content of .strtab, the order of section headers, etc. We can't say: the first 3 input lines test the case of 3 symbols, the last 4 input lines test the case with 4 symbols - .symtab is a singleton, if you want to test 4 symbols, the 3 symbol case is not tested any more.

What we currently do is to add another test file, with some variation to the first test file. It is not rare for me to diff two test files and find the differences. It'd be easier if two similar things are tested in one file, with a comment explaining the differences.

yaml2doc supports --docnum=<num> to allow multiple documents in a file. Combing tests prudently can improve readability. This patch adds a similar --doc-id=<id> to llvm-mc.

I am open to an alternative to "doc". llvm-mc supports both assembly and disassembly (both are in textual forms). I'd expect "obj" to refer to a binary form input, which isn't the case for llvm-mc.

Fair point - 'asm', perhaps? (think that's probably fine/close enough for the disassembly that people aren't going to be too confused by it)

Presumably "doc" makes sense in the yaml2doc context - perhaps "obj" would make more sense in llvm-mc

But I'm not sure it's the best direction - we have lots of input files (testing IR, testing C++, etc) where we don't have this ability to have multiple distinct inputs in one file & it seems to work OK? Is there something different about llvm-mc that motivates this feature here especially?

If we want to test local things (how IR is lowered, how instructions are assembled, how bytes are disassembled), whatever the containing function is named does not matter. Say, we have a function foo, and we want to test a variant of it, we can just add another function bar, duplicating the content in foo and adding some variance.

However, this works poorly for some "global resources". For example, if we want to test the content of .strtab, the order of section headers, etc. We can't say: the first 3 input lines test the case of 3 symbols, the last 4 input lines test the case with 4 symbols - .symtab is a singleton, if you want to test 4 symbols, the 3 symbol case is not tested any more.

What we currently do is to add another test file, with some variation to the first test file. It is not rare for me to diff two test files and find the differences. It'd be easier if two similar things are tested in one file, with a comment explaining the differences.

Fair enough - thanks for the explanation! I'll probably leave it to some other folks who work more directly with llvm-mc than I do to weigh in on the design here, then.

MaskRay added a comment.EditedJul 13 2020, 5:02 PM

To give an example, let's discuss MC/ELF/debug-md5.s and MC/ELF/debug-md5-err.s. There are two things worth noting:

a) the error cases are in a separate file. Once you add an invalid .file, assembling the whole file is invalid. All the RUN lines have to be RUN: not llvm-mc %s. Thus, people tend to place good cases in one file (so that they can test with RUN: llvm-mc %s and bad tests in another file. For some features, having both good and bad tests in one file may improve readability.
b) .debug_line is a global resource. Whenever we add a (valid) .file, we contribute an entry to the global resource. If we want to test some characteristics when include_directories[0] is A, and other characteristics when include_directories[0] is B. OK, we have to use another file.

rsmith added a subscriber: rsmith.Jul 13 2020, 5:42 PM

Would it make sense to split this into a separate utility, so you could use (eg)

# RUN: extract bb %s | llvm-mc - 2>&1 | FileCheck %s --check-prefix=BB

in general, for any tool that can read from stdin? Or even

# RUN: extract bb %s -o %t.bb
# RUN: llvm-mc %t.bb 2>&1 | FileCheck %t.bb

... to consider only the CHECK lines in the extracted region?

Would it make sense to split this into a separate utility, so you could use (eg)

# RUN: extract bb %s | llvm-mc - 2>&1 | FileCheck %s --check-prefix=BB

in general, for any tool that can read from stdin? Or even

# RUN: extract bb %s -o %t.bb
# RUN: llvm-mc %t.bb 2>&1 | FileCheck %t.bb

... to consider only the CHECK lines in the extracted region?

Teach the extract utility about comment markers of common file extensions (.s, .ll, .c, .cpp)? (To make editors happy, the separator should be a comment in that file (e.g. # ; // --- etc))

(I considered an in-utility option first because the syntax is the simplest. extract bb %s -o %t.bb is a bit long but I can accept it)

Would it make sense to split this into a separate utility[...]?

Teach the extract utility about comment markers of common file extensions (.s, .ll, .c, .cpp)? (To make editors happy, the separator should be a comment in that file (e.g. # ; // --- etc))

I was thinking that you could scan the source file for lines containing your separator (eg ---) and assume the comment marker is whatever comes before it. That seems to work well enough for lit (looking for RUN: etc, preceded by anything).

Just chiming in to say +1 to something that achieves the end goal of being able to have multiple assembly files in the same input. A similar motivation I sometimes run into is wanting multiple input files to a tool, e.g. LLD or llvm-readobj, for testing some behaviour. I end up doing one of three things in this situation: 1) adding a separate file in the "Inputs" directory - this is not great because the test input is far away from the actual test (i.e. not in the same file), making it harder to follow; 2) echoing the second and later inputs to separate files at runtime - this is not great because it has a runtime cost; 3) a recent one I've used occasionally is the .if directive to generate different outputs from the same input - this isn't great as it can be somewhat confusing/hard to read at times.

I don't mind which approach is taken (separate tool/addition to llvm-mc), as long as it is simple to use and reasonably efficient.

llvm/tools/llvm-mc/llvm-mc.cpp
53–54

This mentions --- but the code uses --. I'd actually prefer --- probably, but it's minor, so happy to go with a different approach.

I like the aim of this patch and personally I'd slightly prefer to build in this functionality into llvm-mc rather than have an utility,
because it might make test cases shorter and a bit simpler.

One point from me: yaml2obj by default has --docnum=0. I.e. you can either specify the --docnum, or omit and have the same result.
But llvm-mc should not have such behavior. So I just want to ensure that when no --doc-id is requested, then llvm-mc does nothing
for an input. I see it is what this patch implements already, but there is no test for this case?

llvm/tools/llvm-mc/llvm-mc.cpp
317

Consider making this more generic

Expected<std::unique_ptr<MemoryBuffer>> extractDocWithID(const MemoryBuffer& Buf, unsigned ID).

Wouldn't using .ifdef directives with --defsym on the command line work equally well? At least for .s files run through llvm-mc.
There are examples of this in the lit tests already, easy enough to find.

Wouldn't using .ifdef directives with --defsym on the command line work equally well? At least for .s files run through llvm-mc.
There are examples of this in the lit tests already, easy enough to find.

I somehow forgot this. .ifdef ERR + llvm-mc --defsym=ERR=1 seems convenient enough (e.g. D83751) so I am dropping this patch.

Create D83834 for the standalone utility 'extract'.

MaskRay abandoned this revision.Jul 14 2020, 5:34 PM