This is an archive of the discontinued LLVM Phabricator instance.

Fix TestSignedTypes.py by removing a bogus step-over
AbandonedPublic

Authored by amccarth on Jan 19 2016, 4:08 PM.

Details

Reviewers
zturner
Summary

Apparently something changed with thread step-over, causing execution to move outside the stack frame, and thus the local variables were no longer visible.

Since the step-over is unrelated to the purpose of the test and since the comment for that line was just plain wrong (the breakpoint set is already beyond the last assignment, so this seems legit), I deleted that step.

Tested on Windows with both Python 2.7 and 3.5.

Diff Detail

Event Timeline

amccarth updated this revision to Diff 45314.Jan 19 2016, 4:08 PM
amccarth retitled this revision from to Fix TestSignedTypes.py by removing a bogus step-over.
amccarth updated this object.
amccarth added a reviewer: zturner.
amccarth added a subscriber: lldb-commits.
zturner edited edge metadata.Jan 19 2016, 4:19 PM

I don't think something has changed within lldb though, because jsut updating lldb without updating llvm and clang also don't trigger this problem. I think we should try to figure out what really broke.

I agree with Zachary. Just because a test found a bug that wasn't essential to the test doesn't mean we should "fix" the test by silencing the part of the test that uncovered the bug.

This test puts a breakpoint on a 'puts("")' statement and steps over it. All the variables should still be available at that point. So this is a real bug.

If really want to get this particular test working, then it would be okay to add another test that uncovers the same bug but doesn't test the actual values, xfail that and then delete the step here.

fwiw, thread step-over is *still* broken on Windows (because, well, fixing it is kind of hard). So I'm not surprised that breaks (although I'm a little surprised it worked in the first place). But at the same time this test broken, about 15 other tests broke as well, so we need to figure out what the underlying cause of the breaks is (and we can always remove this step-over later)

amccarth abandoned this revision.Jan 19 2016, 4:36 PM

Interestingly, the test_step_over_dwarf test is now getting an UNEXPECTED SUCCESS on Windows.

The TestUnsignedTypes.py test (from which this one was apparently copied--see the class name) doesn't have the step-over.

FWIW, I think Adrian's original point is that testing the behavior of signed types shouldn't depend on step over functionality. It's good practice in general to make tests depend on as little debugger functionality as possibly to reliably test the thing you want to test. Because the more functionality you depend on, the more fickle your test becomes. Why does a bug in one platform's implementation of step over break a test about whether signed ints work?

So, I'm all for removing this test's dependency on step-over (TestUnsignedTypes doesn't use step over, for example) if there's a way to reliably test the functionality without step over.

But I still think it's important to know what CL broke all these tests.

FWIW, I think Adrian's original point is that testing the behavior of signed types shouldn't depend on step over functionality. It's good practice in general to make tests depend on as little debugger functionality as possibly to reliably test the thing you want to test. Because the more functionality you depend on, the more fickle your test becomes. Why does a bug in one platform's implementation of step over break a test about whether signed ints work?

So, I'm all for removing this test's dependency on step-over (TestUnsignedTypes doesn't use step over, for example) if there's a way to reliably test the functionality without step over.

But I still think it's important to know what CL broke all these tests.

In general I agree with your concept of trying to make the tests standalone without depending on a lot of other functionality, but I see a major issue. Currently our test coverage is low even for the basic functionality (backtrace, frame variables, stepping, etc.) and a lot of failure in these areas are detected by completely unrelated tests because they are depending on them and as a result doing some sort of stress testing (each test try a slightly different situation). If we make all of our tests standalone without increasing the number of tests by a lot (I guess 2-5x needed) I expect that we will lose a lot of coverage. I am more happy if an unrelated test fails because we introduced a bug then ending up with a lot more undetected bugs.

I sort of agree with this and sort of don't. Formally, I agree with the notion of limited focused tests. But in practice it is often the noise in tests that catches bugs that we don't yet have tests for. And especially when the "noise" is doing things like step over that 100% should work in any functional debugger... So I am also a little leery of cleaning up the tests too much so that they only test the things we've thought to test and miss other things.

Jim

I sort of agree with this and sort of don't. Formally, I agree with the notion of limited focused tests. But in practice it is often the noise in tests that catches bugs that we don't yet have tests for. And especially when the "noise" is doing things like step over that 100% should work in any functional debugger... So I am also a little leery of cleaning up the tests too much so that they only test the things we've thought to test and miss other things.

Jim

I think we should solve that by adding more tests though. I mean I don't want to do a 1 step forward, 2 steps back kind of thing. If someone knows of an area where we're missing coverage, then stop what you're doing and go add the coverage. If it's the type of thing where we don't know we're missing coverage and we're just hoping an unrelated test catches it someday, I don't agree with that. Just because we know we have a problem doesn't mean we should solve it the wrong way, because then we can never possibly reach the desired end state since we're actively moving farther away from it.

For example, what if someone adds a test that uses a very small amount of functionality but tests the thing it's supposed to test. Do we block the change and say "please make this test more complicated, we want to test as much stuff as possible"? Of course not. It's ridiculous right?

It sounds like at least 2 people on this thread know of specific areas where we're lacking test coverage. Why not spend a few days doing nothing but improving test coverage?

I think to get a very good coverage for the basic functionalities we need
some sort of stress testing or fuzzing infrastructure because the most
issues I hit come from compilers generating something a bit different then
we are expecting. For stack unwinding I added a stress test
(TestStandardUnwind.py, working on arm and aarch64) what catches a lot of
bugs but running it takes so long that it can't be added to the everyday
test suit. Should we try to design a fuzz testing infrastructure what runs
on a build bot so we don't have to depend on the fuzz testing behavior of
the normal test suit?

By having tests doing too much, we get problems like this (where ~12 seemingly unrelated tests started failing over the holiday weekend) instead of one test that points to the root cause. (In this case, we're still looking for the root cause of the regression. A simple bisect doesn't help because we have separate repos for LLVM, clang, and lldb.)

I don't know, I still disagree. If something in step-over breaks, I dont' want to dig through a list of 30 other tests that have nothing to do with the problem, only to find out 2 days later that the problem is actually in step over. The only reason this helps is because the test suite is insufficient as it is. But it doesn't need to be!

The real solution is for people to start thinking about tests more. I've hounded on this time and time again, but it seems most of the time tests only get added when I catch a CL go by with no tests and request them. Sometimes they don't even get added then. "Oh yea this is on my radar, I'll loop back around to it." <Months go by, no tests>. Hundreds of CLs have gone in over the past few months, and probably 10 tests have gone in. *That's* the problem. People should be spending as much time thinking about how to write tests as they are about how to write the thing they're implementing. Almost every CL can be tested. Everything, no matter how small, can be tested. If the SB tests are too heavyweight, that's what the unit tests are for. IF there's no SB API that does what you need to do to test it, add the SB API. "But I have to design the API first" is not an excuse. Design it then.

We've got an entire class of feature that "can't be tested" (the unwinder). There's like 0 unwinding tests. I get that it's hard, but writing a debugger is hard too, and you guys did it. I do not believe that we can't write better tests. Or unwinder tests. Or core-file debugging tests.

Really, the solution is for people to stop chekcing in CLs with no tests, and for people to spend as much time writing their tests as they do the rest of their CLs. If the problem is that people don't have the time because they've got too much other stuff on their plate, that's not a good excuse and I don't think we should intentionally encourage writing poor tests just because someone's manager doesn't give them enough time to do things the right way.

Perhaps a middle ground to these two sides could be something along the
lines of "If you're going to make sweeping changes to remove a particular
feature from a set of tests, make sure there's a reasonable amount of
isolated coverage of the thing you're removing".

Honestly though, our culture of testing really needs to imrpove at the
larger scale going forward. We need more tests, and we need to start
blocking or reverting CLs that don't have some amount test coverage. A
common one that I see is "well this is just putting some infrastructure in
place that isn't being used anywhere yet". But even that is still
testable. That's exactly what unit tests, mock implementations, and
dependency injection are for.

I don't know, I still disagree. If something in step-over breaks, I dont' want to dig through a list of 30 other tests that have nothing to do with the problem, only to find out 2 days later that the problem is actually in step over. The only reason this helps is because the test suite is insufficient as it is. But it doesn't need to be!

I agree but first we should fix the test coverage and then fix the individual tests. Doing it in the opposite way will cause a significant drop in quality (we will fix individual tests but not increase the coverage enough).

The real solution is for people to start thinking about tests more. I've hounded on this time and time again, but it seems most of the time tests only get added when I catch a CL go by with no tests and request them. Sometimes they don't even get added then. "Oh yea this is on my radar, I'll loop back around to it." <Months go by, no tests>. Hundreds of CLs have gone in over the past few months, and probably 10 tests have gone in. *That's* the problem. People should be spending as much time thinking about how to write tests as they are about how to write the thing they're implementing. Almost every CL can be tested. Everything, no matter how small, can be tested. If the SB tests are too heavyweight, that's what the unit tests are for. IF there's no SB API that does what you need to do to test it, add the SB API. "But I have to design the API first" is not an excuse. Design it then.

I think we need a different API for tests then the SB API which can be changed more freely without have to worry about backward compatibility. When adding a new feature I try to avoid adding an SB API until I know for sure what data I have to expose because a wrong decision early on will carry forward (how many deprecated SB API calls we have?).

We've got an entire class of feature that "can't be tested" (the unwinder). There's like 0 unwinding tests. I get that it's hard, but writing a debugger is hard too, and you guys did it. I do not believe that we can't write better tests. Or unwinder tests. Or core-file debugging tests.

Really, the solution is for people to stop chekcing in CLs with no tests, and for people to spend as much time writing their tests as they do the rest of their CLs. If the problem is that people don't have the time because they've got too much other stuff on their plate, that's not a good excuse and I don't think we should intentionally encourage writing poor tests just because someone's manager doesn't give them enough time to do things the right way.

It is true that every CL can be tested but a lot of change is going in to address a specific edge case generated by a specific compiler in a strange situation. To create a reliable test from it we have to commit in a compiled binary with the strange/incorrect debug info and then it will be a platform and architecture specific test what is also very hard to debug because you most likely can't recompile it with your own compiler. I am not sure we want to go down in this road.

I don't know, I still disagree. If something in step-over breaks, I dont' want to dig through a list of 30 other tests that have nothing to do with the problem, only to find out 2 days later that the problem is actually in step over. The only reason this helps is because the test suite is insufficient as it is. But it doesn't need to be!

I agree but first we should fix the test coverage and then fix the individual tests. Doing it in the opposite way will cause a significant drop in quality (we will fix individual tests but not increase the coverage enough).

The real solution is for people to start thinking about tests more. I've hounded on this time and time again, but it seems most of the time tests only get added when I catch a CL go by with no tests and request them. Sometimes they don't even get added then. "Oh yea this is on my radar, I'll loop back around to it." <Months go by, no tests>. Hundreds of CLs have gone in over the past few months, and probably 10 tests have gone in. *That's* the problem. People should be spending as much time thinking about how to write tests as they are about how to write the thing they're implementing. Almost every CL can be tested. Everything, no matter how small, can be tested. If the SB tests are too heavyweight, that's what the unit tests are for. IF there's no SB API that does what you need to do to test it, add the SB API. "But I have to design the API first" is not an excuse. Design it then.

I think we need a different API for tests then the SB API which can be changed more freely without have to worry about backward compatibility. When adding a new feature I try to avoid adding an SB API until I know for sure what data I have to expose because a wrong decision early on will carry forward (how many deprecated SB API calls we have?).

Do you have a concrete example of where you don't want to add an SB API, but a unit test isn't ideal?

It is true that every CL can be tested but a lot of change is going in to address a specific edge case generated by a specific compiler in a strange situation. To create a reliable test from it we have to commit in a compiled binary with the strange/incorrect debug info and then it will be a platform and architecture specific test what is also very hard to debug because you most likely can't recompile it with your own compiler. I am not sure we want to go down in this road.

You can test cases like this easily with unit tests and dependency injection. Just make a mock interface that returns the exact values you need when queried for certain symbols or whatever.

I don't know, I still disagree. If something in step-over breaks, I dont' want to dig through a list of 30 other tests that have nothing to do with the problem, only to find out 2 days later that the problem is actually in step over. The only reason this helps is because the test suite is insufficient as it is. But it doesn't need to be!

I agree but first we should fix the test coverage and then fix the individual tests. Doing it in the opposite way will cause a significant drop in quality (we will fix individual tests but not increase the coverage enough).

The real solution is for people to start thinking about tests more. I've hounded on this time and time again, but it seems most of the time tests only get added when I catch a CL go by with no tests and request them. Sometimes they don't even get added then. "Oh yea this is on my radar, I'll loop back around to it." <Months go by, no tests>. Hundreds of CLs have gone in over the past few months, and probably 10 tests have gone in. *That's* the problem. People should be spending as much time thinking about how to write tests as they are about how to write the thing they're implementing. Almost every CL can be tested. Everything, no matter how small, can be tested. If the SB tests are too heavyweight, that's what the unit tests are for. IF there's no SB API that does what you need to do to test it, add the SB API. "But I have to design the API first" is not an excuse. Design it then.

I think we need a different API for tests then the SB API which can be changed more freely without have to worry about backward compatibility. When adding a new feature I try to avoid adding an SB API until I know for sure what data I have to expose because a wrong decision early on will carry forward (how many deprecated SB API calls we have?).

Do you have a concrete example of where you don't want to add an SB API, but a unit test isn't ideal?

Recently we added several language specific commands for render script and for go. I don't think we want to add SB API support for these at the moment because nobody uses them so maintaining it could be problematic.

It is true that every CL can be tested but a lot of change is going in to address a specific edge case generated by a specific compiler in a strange situation. To create a reliable test from it we have to commit in a compiled binary with the strange/incorrect debug info and then it will be a platform and architecture specific test what is also very hard to debug because you most likely can't recompile it with your own compiler. I am not sure we want to go down in this road.

You can test cases like this easily with unit tests and dependency injection. Just make a mock interface that returns the exact values you need when queried for certain symbols or whatever.

To do this we have to mock a lot of thing including the full Process and Thread classes what I am not too fancy doing because I expect that a completely unrelated change will break the mock.

Personally I am more happy to debug a complicated test then to maintain a huge number of tests but I know that lot of people have different preferences. Also I prefer using stress tests and fuzzing instead of trying to create a specific test for each edge case we can think about because it keeps the number of tests low and relatively easy to maintain with having a high probability of detecting issues.

Sure, an interface change to Process might break the mock, but it would break at compile time, you just fix it up. It's not something that would happen frequently, this is the same situation going on in LLVM where there are unit tests, sometimes they break, and people fix them. But it's never been an issue because it doesn't happen often.

I know there are different opinions about what type of tests to write and how reduced they should be, but in this case I'm simply following the LLVM guidelines for writing test cases, which we supposed to adhere to. http://llvm.org/docs/DeveloperPolicy.html#test-cases (except for obvious differences about how it uses lit, etc)

Right, but I don't agree in this case that "different" has to mean "we
discourage the use of reduced test cases". I have a hard time imagining a
scenario where not having reduced test cases is an advantage. It's also
easy to explain to people. "Write reduced test cases". It's easy to
understand. Moreso than "Reduced test cases are good, but not always
because fuzziness, but we can't really put into words exactly what amount
of fuzziness we want, or what it should look like, so I guess anything goes

  • don't ask don't tell".

If we want fuzziness, we should do fuzz testing. Feature tests and
regression tests should be reduced. The whole reason the tests in
lldb/tests were written is because people were testing a specific feature
or bugfix. I don't see a reason to add fuzziness to this kind of test.
All it does is pollute failure logs. Fuzziness is better tested by fuzz
tests.

I'm not sure this is a terribly productive discussion.

Since I know that the debugger is stateful, when I write a test where I get to point A and do thing X, I will often add - while I'm there - "step again and see if it still works" or something morally equivalent to that. I have found that to be a method of test writing for debuggers that very often catches bugs. The fact that this test will then break if "step" breaks doesn't bother me because a) this might be the only example where step breaks in this particular way, so that was actually a plus, and if something basic like step breaks we're going to fix it right away anyway 'cause it is step not working...

I am also not against writing more focused tests when that is appropriate. But I am also pretty sure formal considerations are unlikely to outweigh this pretty consistent experience of writing tests for debuggers.

Jim

So, I don't actually disagree that the idea of doing a "step and see if it
still works" is ok. Because I think that -- by itself -- is useful as an
additional check. I'm not against doing extra work in a test when it adds
something concretely useful to the test. So I still think we should reduce
all test cases as much as possible, but as long as we can justify each
operation as adding additional robustness to that particular test, I'm fine
with it.

That said, I don't think we should discourage finding better ways to test
something either. If someone comes up with a different approach to writing
a test for a particular feature, and that approach relies on less debugger
functionality, but still tests the feature in question in , I think that
should be encouraged. More granular tests give more clues as to the nature
of the underlying problem, less false positives on failures, and make the
test suite run faster.