This is an archive of the discontinued LLVM Phabricator instance.

Implement std::experimental::sample
ClosedPublic

Authored by eugenis on Apr 15 2015, 5:25 PM.

Details

Summary

Implementation of std::experimental::sample from fundamentals v1:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4335.html#alg.random.sample

Diff Detail

Event Timeline

eugenis updated this revision to Diff 23813.Apr 15 2015, 5:25 PM
eugenis retitled this revision from to Implement std::experimental::sample.
eugenis updated this object.
eugenis edited the test plan for this revision. (Show Details)
eugenis added reviewers: mclow.lists, EricWF.
eugenis set the repository for this revision to rL LLVM.
eugenis added a subscriber: Unknown Object (MLST).
EricWF edited edge metadata.Apr 15 2015, 6:36 PM

You'll need to use reserved identifiers and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.

include/experimental/algorithm
50

We need a void cast like ++first, (void) ++k to prevent calling a comma operator overload.

You'll need to use reserved identifiers

Do you mean for local variable names? Is there a technical reason for this, or just a code style thing?

and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.

n4335 does not seem to require that this is only available in C++14. And aren't names under std:: already reserved, and an addition of a new std:: name could not break compliant code?

You'll need to use reserved identifiers

Do you mean for local variable names? Is there a technical reason for this, or just a code style thing?

For any name not defined in the standard you should use reserved identifiers. Assume that the user of the standard library defines every non-reserved, non-defined name as a macro. We use reserved identifiers so that conforming programs can define non-reserved names globally and still use the standard library.

For example:

#define PopulationIterator char*
#define SampleIterator char*
#define UniformRandomNumberGenerator &rand
#include <experimental/algorithm>

and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.

n4335 does not seem to require that this is only available in C++14. And aren't names under std:: already reserved, and an addition of a new std:: name could not break compliant code?

I would prefer not to restrict these things to c++14 only, although there may be a reason that conforming implementations need to that isn't visible in n4335.

However because sample is a function it can be looked up by ADL. Since sample takes user-defined iterator types and RNG's it is possible that existing unqualified calls to sample in user code may now find std::sample.

You'll need to use reserved identifiers

Do you mean for local variable names? Is there a technical reason for this, or just a code style thing?

For any name not defined in the standard you should use reserved identifiers. Assume that the user of the standard library defines every non-reserved, non-defined name as a macro. We use reserved identifiers so that conforming programs can define non-reserved names globally and still use the standard library.

For example:

#define PopulationIterator char*
#define SampleIterator char*
#define UniformRandomNumberGenerator &rand
#include <experimental/algorithm>

Good point, will do.

and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.

n4335 does not seem to require that this is only available in C++14. And aren't names under std:: already reserved, and an addition of a new std:: name could not break compliant code?

I would prefer not to restrict these things to c++14 only, although there may be a reason that conforming implementations need to that isn't visible in n4335.

However because sample is a function it can be looked up by ADL. Since sample takes user-defined iterator types and RNG's it is possible that existing unqualified calls to sample in user code may now find std::sample.

That's a valid concern, but limiting this definition to C++14 will still break existing C++14 code. Is it really that much better than breaking C++11 and C++14 at the same time?
Also, that would be std::experimental::sample.

You'll need to use reserved identifiers

Do you mean for local variable names? Is there a technical reason for this, or just a code style thing?

For any name not defined in the standard you should use reserved identifiers. Assume that the user of the standard library defines every non-reserved, non-defined name as a macro. We use reserved identifiers so that conforming programs can define non-reserved names globally and still use the standard library.

For example:

#define PopulationIterator char*
#define SampleIterator char*
#define UniformRandomNumberGenerator &rand
#include <experimental/algorithm>

Good point, will do.

and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.

n4335 does not seem to require that this is only available in C++14. And aren't names under std:: already reserved, and an addition of a new std:: name could not break compliant code?

I would prefer not to restrict these things to c++14 only, although there may be a reason that conforming implementations need to that isn't visible in n4335.

However because sample is a function it can be looked up by ADL. Since sample takes user-defined iterator types and RNG's it is possible that existing unqualified calls to sample in user code may now find std::sample.

That's a valid concern, but limiting this definition to C++14 will still break existing C++14 code. Is it really that much better than breaking C++11 and C++14 at the same time?
Also, that would be std::experimental::sample.

This is probably a conversation worth having on a larger scale. I think the rational for keeping in C++14 only is:

  1. N4335 references the C++14 standard as its base document
  2. C++11 should be "stable". Additions added to the language after C++11 not break valid code.

However I think these new features are useful enough that there is a strong motivation to provide them whenever possible. Also, in order for this to break existing code the user has to be including <experimental/*> headers anyway. Furthermore I believe we already allow string_view to be used in C++11.

I'll open up a bug to discuss this issue.

eugenis updated this revision to Diff 24403.Apr 24 2015, 10:53 AM
eugenis edited edge metadata.
eugenis removed rL LLVM as the repository for this revision.

Switched to reserved identifiers.

include/experimental/algorithm
50

done

54

This thing is costing us lots of performance - it's just way too fancy.
It makes this code is about 2x slower than std::random_sample in existing implementations, but I guess it makes the sample more uniform, as well.

Do you thing smth like (g() - g::min()) % ( __k + 1) would be acceptable?

mclow.lists edited edge metadata.Apr 24 2015, 11:21 AM

@Eugenius, you should probably take a look at LWG issue #2463
http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#2463

I don't think it will require a code change; but you should make sure that we satisfy it.

@Eugenius, you should probably take a look at LWG issue #2463
http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#2463

I don't think it will require a code change; but you should make sure that we satisfy it.

We do. Thanks for pointing this out.
The code does exactly one pass over [first, last), and each iteration has constant complexity.

However because sample is a function it can be looked up by ADL. Since sample takes user-defined iterator types and RNG's it is possible that existing unqualified calls to sample in user code may now find std::sample.

No, wait, there is no std::sample. It's std::experimental::sample. And the bad ADL could only happen if one of the arguments is in std::experimental, which probably means it's intentional.

../1.cc:9:3: error: no member named 'sample' in namespace 'std'; did you mean 'std::experimental::sample'?

std::sample(a, b, c, d, t);

I'm very confused here.

The LFTS document references N3925 as the source of the sample proposal.

That paper has sample code, and talks about both "reservoir sampling" and "selection sampling", and using one for random access iterators and the other for forward iterators, and includes a reference implementation. This code looks nothing like that.

I also don't see any tests that deal with:

  1. Stability "Stable if and only if PopulationIterator meets the requirements of a ForwardIterator type."
  2. Empty input ranges. (should write nothing)
  3. test for n == 0 (should write nothing)
  4. No tests with different kinds of iterators.

I should have read the entire document. Sorry about that, will fix soon.

Two more comments:

  1. In the libc++ test directory (support), there's a file full of iterator adapters that take a int * (say), and turn it into a forward iterator (for example). Makes writing tests simpler.
  1. Here's another test candidate: PopulationIteartor's value type shall be writeable to out.

That means that you should be able to generate a sample of (say) doubles from a population of ints.

  • SampleIterator shall meet the additional requirements of a RandomAccessIterator type

unless PopulationIterator meets the additional requirements of a ForwardIterator
type.

So if I call sample(InputIterator, InputIterator, ForwardIterator, 1, rnd), I should get a compile error.

eugenis updated this revision to Diff 24652.Apr 29 2015, 1:10 PM
eugenis edited edge metadata.

The specs suggest that we define a feature testing macro, but I could not find a precedent for that in libc++ include/experimental.

The new tests don't test the instability of the reservoir sampling algorithm. Should they?

The new tests don't test the instability of the reservoir sampling algorithm. Should they?

Yes - but I won't hold up accepting this because that test is missing.

The phrase in the standard "stable if and only if"...

test/std/experimental/algorithms/alg.random.sample/sample.stable.pass.cpp
45

Couldn't this just be a call to std::is_sorted(first, last) ?

mclow.lists added inline comments.Apr 30 2015, 8:44 AM
include/experimental/algorithm
86

I think that should forward the RNG: _VSTD::forward<_UniformRandomNumberGenerator>(__g)

eugenis updated this revision to Diff 24727.Apr 30 2015, 8:56 AM

The new tests don't test the instability of the reservoir sampling algorithm. Should they?

Yes - but I won't hold up accepting this because that test is missing.

The phrase in the standard "stable if and only if"...

Fixed anyway, this is easy.

include/experimental/algorithm
86

done

test/std/experimental/algorithms/alg.random.sample/sample.stable.pass.cpp
45

done

However because sample is a function it can be looked up by ADL. Since sample takes user-defined iterator types and RNG's it is possible that existing unqualified calls to sample in user code may now find std::sample.

No, wait, there is no std::sample. It's std::experimental::sample. And the bad ADL could only happen if one of the arguments is in std::experimental, which probably means it's intentional.

../1.cc:9:3: error: no member named 'sample' in namespace 'std'; did you mean 'std::experimental::sample'?

std::sample(a, b, c, d, t);

Urg, That was a stupid mistake on my part. Sorry and thanks for the correction.

include/experimental/algorithm
40

This should probably have the system header pragma found in most other headers.

54

Not sure, but let's do it correctly first then apply the optimization.

eugenis updated this revision to Diff 24743.Apr 30 2015, 10:15 AM
eugenis added inline comments.
include/experimental/algorithm
40

done

More comments. The __sample that takes an input iterator and a random access iterator LGTM.

include/experimental/algorithm
59

_Distance is the wrong type to be using here. It's possible that _Distance is a type that cannot represent the value of __k. The LFTS spec seems misleading about how distance should be used. I'll have to file a standard defect to clarify the wording.

I think __k should be the distance_type of the _PopulationIterator.

Example Code (This has UB and exits with -6 on my machine).

std::vector<int> pop(1024, 42);
    std::vector<int> out(128, 0);
    std::minstd_rand g;
    std::experimental::sample(pop.begin(), pop.end(), out.begin(), (signed char)127, g);
75

Do we need to re-construct the uniform_int_distribution every iteration?

Since it is possible that we will not guard std::experimental::random_sample in a C++14 only block the tests should compile in C++03. Here is my patch to get them working. https://gist.github.com/EricWF/bef9f2a3ca4c86cb0781.

PS. you can pass --param=std=c++03 when invoking LIT to run the tests in C++03. Example:

lit -sv --param=std=c++03 test/std/experimental/algorithm

I'm fairly confident in the selection sampling algorithm as well. However I'm not quite ready to sign off on it. The algorithm as Knuth describes it generates its random numbers slightly differently.

Knuth lists the following steps for selection sampling:

  • Generate a random number U uniformly distributed between zero and one.
  • if (unsampled_size * U) < remaining_inserts then choose that element.

However this implementation of sample does:

  • Generate a random number U uniformly distributed between zero and unsampled_size - 1.
  • if U < remaining_inserts then choose that element.

My concern is that semantic change could introduce a bias. However I'm bad at math and stats so I don't know.
I want to think on it a little more to make sure they are equivalent.

I've also created a patch of various changes and additions related to adding the <experimental/algorithm> header and making the tests run in C++03 (this invalidates the above patch.). Please consider applying this to your patch.
https://gist.github.com/EricWF/c62b0979202cfd2cc6af

include/experimental/algorithm
74

Same note about using _Distance

Correct me if I am wrong, but as I read the standard it requires that if _PopulationIterator meets the requirements of ForwardIterator then we have to use selection sampling (because it is stable). Currently this requirement does not hold. If _PopulationIterator is a forward iterator and _SampleIterator is a random access iterator then reservoir sampling is used.

One way to fix this would be to dispatch to __sample(...) using either true_type or false_type instead of two iterator tags.

Here is an updated patch that fixes most of my concerns: https://gist.github.com/EricWF/c62b0979202cfd2cc6af

I've made the following changes:

  • Fixes issues regarding the type of _Distance.
  • Always select the selection sample algorithm when _PopulationIter is a ForwardIterator.
  • Include <__undef_min_max> because min/max are used.
  • Add _LIBCPP_DEBUG assert when Distance is negative.
  • Get tests compiling in C++03.
  • Adds more tests relating to adding the header <experimental/algorithm>.
  • Update double include header test to include <experimental/algorithm>.
  • Update the documentation regarding the status of LFTS Algorithms.
include/experimental/algorithm
75

Not sure what I was thinking. Obviously we do.

eugenis added inline comments.Apr 30 2015, 6:25 PM
include/experimental/algorithm
75

Well, uniform_int_distribution has an overload of operator() that can change the distribution parameters without re-creating the object, but it does exactly the same amount of work.

eugenis updated this revision to Diff 25229.May 7 2015, 1:53 PM

Added Eric's changes to the diff.

EricWF accepted this revision.May 12 2015, 7:04 PM
EricWF edited edge metadata.

LGTM too.

This revision is now accepted and ready to land.May 12 2015, 7:04 PM
eugenis closed this revision.May 13 2015, 9:59 AM

Thanks!
Committed as r237264.