This implements the idea from the llvm-dev RFC, intending to combine the advantages from FileCheck regression tests and gtest unittests.
Tests are organized like lit RUN scripts as files in the test directory. Their compilation is organized by cmake, wither within the LLVM build configuration itself or as a subproject, configurable using the LLVM_EXTERNAL_COMPILED_TESTS configuration variable.
As an example, I converted the vectorizeVFone.ll test to a compiled test. It's sole reliance of CHECK-NOT with its risk of a false pass has already been discussed in D79124 and D78054. The compiled test has more asserts than necessary,
Missing in this Differential
- The library compilertestboilerplate is basically a copy of the opt program which makes the transition of existing test easier, but I don't intent to keep it that way. Either the opt sources would be used directly or dedicated methods created.
- gtest will can launch tests in-process as well as subprocesses. When launching run_opt multiple times in the same process, cl::opt must be reset for which there currently no API.
- I did not yet implement ASSERT_ALL_OF from the RFC, which should make the assert lines shorter (and print more useful messages when failing).
- TESTS() with the same TestSuiteName.TestName will clash when linked together. The TestSuiteName could be derived from the file name, and maybe compiledtestboilerplate contain a fixture class that reads the IR itself without a call to run_opt.
- I found that linking all the tests of a single directory into the same gtest executable is a good trade-off between link overhead and artifact dependency. When using BUILD_SHARED_LIBS or LLVM_BUILD_LLVM_DYLIB one could also have separate executable for each test, unfortunately these are not supported on Windows, so we cannot require them.
What's wrong with FileCheck?
Nothing, for the right uses cases. However, FileCheck seem to be the hammer that makes everything look like nails. clang -verify is a domain-specific checker for errors/warnings that makes the experience for better than FileCheck would. Here are features that I think could be improved (subjective!!!):
- Complexity of regex matching: -DAG, -NOT, -NEXT, -SAME, -LABEL, string- and numeric variables. Often counter-intuitive default behavior (CHECK: br matches subregion ie, word boundaries ignored by default, [[V:.*]] matching more than expected, start- and end of line ignored, ...). Gotchas such as mentioned using a named regex a second time silently overwriting the previous variable, but not checking it and others mentioned on the mailing list.
- Often hard to understand what an existing test is supposed to check and which parts are relevant. A simple "CHECK: br" to check a branch instructions leads to false positives, so more than the direct concern is checked. Often, the concern is only checked indirectly, such as branch instructions when only the control flow matters. Various tricks are used, such as
CHECK: X CHECK: X CHECK-NOT: X"
to check for exactly 2 occurrences of X; or CHECK: x double> to verify the existence of vector code.
- Analyzing what is supposed to happen in a failed test is pain. A CHECK of a missing instruction more often than not matches a different instruction, and only later CHECKs using regex variables will fail, requiring understanding what it could match in often huge outputs. I often run the test before without my change with --dump-input to see what something is supposed to match. It becomes even worse when too much is tested into a single file, multiple functions, etc. In theory, CHECK-LABEL can help here, but it is not used consistently in all tests with multiple functions, and --dump-input still dumps the entire input instead of just the wrong sections. It also doesn't help for e.g. attribute and metadata at the end of a file. Since metadata is also uniqued, a single metadata node is relevant for multiple functions, without an indication which should belong to which.
- Many tests verify LLVM_DEBUG(dbgs() << ...) output, such that non-assert build skip many tests. It also changes the meaning of llvm::dbgs() which presumable at some point was meant as "printf-debugging" aid, now is part of the correctness test such that selectively remove them anymore to reduce output verbosity or add new ones without risking of breaking regression tests. Tests often also mix stdout and stderr which has undefined ordering (stdout is buffered at multiple levels, stderr usually is not).
- Sensitive to unrelated changes, such as ordering of basic blocks, additional instruction changing value names. Adding additional information to an analysis printer such as -analyze -scalar-evolution breaks existing tests. In particular CHECK-NOT tests are prone to accidentally pass, e.g. when no output is produced at all or the output format has changed. Typically, when authoring a change that break unrelated tests, only the regression tests that fail are fixed.
- For more robust tests, there is the possibility to use regex variables for value names, use CHECK-DAG etc., which makes writing a good test very time-consuming. The only reliable way to check whether a newly written test test what it is supposed to match is to run it before and after the change it is supposed to test.
- Because of all of the above, maintenance of all the regression tests is a nightmare. I expect it to be a serious issue for introducing opaque pointers. My prediction is that we will have a typed-pointer command line flag to not have to require updating all the write-only regression tests.
- There is a trend to instead just generate the CHECK lines using update_test_checks.py. Its doc mentions that one should keep only the relevant CHECK lines, but in practice usually check of the entire output is committed, worsening the problem of understanding what is being checked and adding noise when updating regression tests. However, if this workflow is generally applied, we could automatically re-generate ALL tests reducing the burden of test maintanance.
What's wrong with unittests?
Nothing, I wish unittests on LLVM-IR would be more common. Here is why I think why there are relatively few unittests on IR.
- Significant boilerplate to add a new unittest: CMakeLists.txt, code to create a pass pipeline, etc. Editing the CMakeLists.txt will also require a re-run of cmake.
- As a consequence, these unittest files tend to be quite big (e.g. > 3000 lines for MetadataTest.cpp)
- IR as strings in C++. Raw strings introduced in C++11 make these a bit more practicable, but indention of the nested IR is not always consistent.
- The unittests are in a different subfolder, making them seem unrelated to the regression tests
Which is exactly what this RFC intends to solve.