We have been working on Dexter for the past few weeks, and we tried to find ways to improve Dexter to make it even better.
One of the ideas we had was to add a tool that can automatically generate Dexter tests. This is the result of our work.
This is a working implementation of a test generation tool. Here's a list of the currently supported features:
- Seamless integration with Dexter: the tool can simply be invoked like any other tool, using python3 dexter.py gen //...
- LLDB debugger support (note: this is currently the only debugger supported)
- Supports generation of DexExpectWatchValue and DexExpectStepKind commands
- Generation of DexExpectStepKind commands can be disabled using --no-expect-steps
- Generation of Dexter commands for each visible function argument, local variable or global variable in the top stack frame of each debugger step
- By default, it generates command for everything, but you can also restrict the set of commands that'll be generated by using --expect-values-of (ALL | ARG | LOCAL | GLOBAL) // default is ALL
- e.g. If you use --expect-values-of ARG, the tool will only generate DexExpectWatchValue commands for function arguments visible in the top stack frame of each debugger step
- By default, it generates command for everything, but you can also restrict the set of commands that'll be generated by using --expect-values-of (ALL | ARG | LOCAL | GLOBAL) // default is ALL
Here is an example output from this tool.
void set(int &dest, int val) { for(int k = 0; k < 5; ++k) dest += val; } int main() { int x = 0; set(x, 5); return x; } //===--- AUTO-GENERATED DEXTER COMMANDS ---===// //DexExpectWatchValue('x', '0', on_line=9, require_in_order=False) //DexExpectWatchValue('dest', '0', '5', '10', '15', '20', '25', on_line=2, require_in_order=False) //DexExpectWatchValue('val', '5', '5', '5', '5', '5', '5', on_line=2, require_in_order=False) //DexExpectWatchValue('dest', '0', '5', '10', '15', '20', on_line=3, require_in_order=False) //DexExpectWatchValue('val', '5', '5', '5', '5', '5', on_line=3, require_in_order=False) //DexExpectWatchValue('k', '0', '1', '2', '3', '4', on_line=3, require_in_order=False) //DexExpectWatchValue('dest', '25', on_line=4, require_in_order=False) //DexExpectWatchValue('val', '5', on_line=4, require_in_order=False) //DexExpectWatchValue('x', '25', on_line=10, require_in_order=False) //DexExpectWatchValue('x', '25', on_line=11, require_in_order=False) //DexExpectStepKind('FUNC', 2) //DexExpectStepKind('VERTICAL_FORWARD', 9) //DexExpectStepKind('VERTICAL_BACKWARD', 5) //===--------------------------------------===//
Of course, this implementation is still in a "prototype" stage. Some things are not perfect and it still needs some work before being fully ready for upstream.
- For instance, the tool always generates commands using require_in_order=False, as there is a bug that prevents us from generating the values in the correct order.
- The current implementation derives from TestToolBase, which is probably not the best thing to do
- The current implementation also collects the set of visible variables for every stack frame of every step, every single time whether we are running in test or gen mode. This is not ideal performance-wise.
That said, this implementation is good enough to to run some tests, so we ran some experiments using this tool (using a bash script); Here are some graphs!
(We generated tests using some configuration, and we ran them against GCC/Clang in every optimization mode)
- Test generated using GCC -Og:
- Test generated using GCC -O2:
- Test generated using Clang O0 that just tests for the value of function arguments and does not expect debugger steps:
Here is the .tar containing every graph (70 total). Keep in mind that those were generated relatively early in the development of this tool so the actual results could be slightly different (but still within 10% I believe):
And here's what we found (our conclusions):
- This tool is great at generating very precise tests, but those tests can be too precise and that can negatively influence the score when running it under some configurations (e.g. a test generated using GCC -Og will not make a perfect score in Clang O0 due to some GCC-isms)
- We found that Dexter isn't good at detecting cases where the debug experience is actually better than what the test expects. The graph above (test generated using GCC -O2) is a perfect example of this
- In short, tests should always be generated under the compiler configuration that provides the best debug experience, else the results can't be trusted (e.g. GCC Og or Clang O0)
- The results of tests generated using the this tool under some compiler (e.g. GCC) shouldn't be trusted if ran under a different compiler (e.g. Clang)
- This is due to some differences in how clang and GCC generate debug info. For instance, GCC generates an extra step for the closing } of a function while Clang does not. In short, every compiler is different, and tests generated using this tool will be inevitably biased towards the compiler used the generate the test.
We'd really like to get some feedback before investing too much work into this tool. So the question is: What do you think? Is there an upstream interest for this?
I understand this is at the prototype stage, but... 😉
This extra else step here I believe is to setup an empty dict that can be filled with commands generated by the test run for the annotated test file generation (correct me if I'm wrong).
I'm not completely against having an else statement here but I feel there should at least be a comment describing why we'd want differing behaviour.
Also, whilst I'm not a huge fan of the 'big-ball-of-context' approach taken in the past, we could reduce the parameter list to get_debugger_steps by interrogating the context object.