Page MenuHomePhabricator

[GWP-ASan] Fix flaky test on Fuchsia
ClosedPublic

Authored by cryptoad on Dec 1 2020, 11:46 AM.

Details

Summary

The LateInit test might be reusing some already initialized thread
specific data if run within the main thread. This means that there
is a chance that the current value will not be enough for the 100
iterations, hence the test flaking.

Fix this by making the test run in its own thread.

Diff Detail

Event Timeline

cryptoad created this revision.Dec 1 2020, 11:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 1 2020, 11:46 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
cryptoad requested review of this revision.Dec 1 2020, 11:46 AM

I'm troubled in two ways.

  1. If this test is polluted by global (TLS) state from prior tests, that seems like a non-hermeticity problem for all these tests. Should we maybe be using a test fixture with SetUp/TearDown to wipe the global state between tests?
  2. It seems like the code under test actually has semantics that when uninitialized, it will sometimes sample (just rarely). That seems like a problem to me. If the initialized state as configured by the environment will be to never do a gwp allocation, then there should never be a window where one is randomly allowed to happen.
hctim added a comment.Dec 1 2020, 1:15 PM

I'm troubled in two ways.

  1. If this test is polluted by global (TLS) state from prior tests, that seems like a non-hermeticity problem for all these tests. Should we maybe be using a test fixture with SetUp/TearDown to wipe the global state between tests?

In other environments (both for upstream testing and Android) we run this test in isolation to avoid this problem.

I'm happy with Kostya's solution to spawn a thread, or resetting the TLS in the start of this test (*gwp_asan::getThreadLocals() = ThreadLocalPackedVariables();)

  1. It seems like the code under test actually has semantics that when uninitialized, it will sometimes sample (just rarely). That seems like a problem to me. If the initialized state as configured by the environment will be to never do a gwp allocation, then there should never be a window where one is randomly allowed to happen.

There's an uninitialized check in the slow path. We might enter GPA::allocate() after 1 << 31 allocations, but we never actually provide a GWP-ASan allocation (it just returns nullptr) and we reset the counter to wait another 1 << 31 times before entering the slow path again.

mcgrathr accepted this revision.Dec 1 2020, 1:28 PM

I'm troubled in two ways.

  1. If this test is polluted by global (TLS) state from prior tests, that seems like a non-hermeticity problem for all these tests. Should we maybe be using a test fixture with SetUp/TearDown to wipe the global state between tests?

In other environments (both for upstream testing and Android) we run this test in isolation to avoid this problem.

What kind of isolation? In the upstream tests/CMakeLIsts.txt I don't see anything segregating this test from others. I see that Android build file has a comment and "isolated: true", but I have no idea what that means because I am not familiar with the Android build system or test-running facilities.

I'm happy with Kostya's solution to spawn a thread, or resetting the TLS in the start of this test (*gwp_asan::getThreadLocals() = ThreadLocalPackedVariables();)

The latter seems mildly preferable to me. That is, it makes clear that this particular TLS state is not doing something useful across tests, it just needs to be wiped.

  1. It seems like the code under test actually has semantics that when uninitialized, it will sometimes sample (just rarely). That seems like a problem to me. If the initialized state as configured by the environment will be to never do a gwp allocation, then there should never be a window where one is randomly allowed to happen.

There's an uninitialized check in the slow path. We might enter GPA::allocate() after 1 << 31 allocations, but we never actually provide a GWP-ASan allocation (it just returns nullptr) and we reset the counter to wait another 1 << 31 times before entering the slow path again.

I see. Thanks for the explanation. That eliminates my only concern that wasn't purely about how the test works, and I'm much more sanguine deferring to you on the test scenario.

This revision is now accepted and ready to land.Dec 1 2020, 1:28 PM
cryptoad updated this revision to Diff 308777.Dec 1 2020, 2:39 PM

Updating the CL with another proposed fix: reseting the TLS data in
the uninit function called by the tests.

mcgrathr accepted this revision.Dec 1 2020, 2:46 PM

lgtm

hctim accepted this revision.Dec 1 2020, 3:04 PM

LGTM

This revision was automatically updated to reflect the committed changes.