Implementation of std::experimental::sample from fundamentals v1:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4335.html#alg.random.sample
Details
- Reviewers
mclow.lists EricWF
Diff Detail
Event Timeline
You'll need to use reserved identifiers and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.
include/experimental/algorithm | ||
---|---|---|
50 | We need a void cast like ++first, (void) ++k to prevent calling a comma operator overload. |
Do you mean for local variable names? Is there a technical reason for this, or just a code style thing?
and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.
n4335 does not seem to require that this is only available in C++14. And aren't names under std:: already reserved, and an addition of a new std:: name could not break compliant code?
For any name not defined in the standard you should use reserved identifiers. Assume that the user of the standard library defines every non-reserved, non-defined name as a macro. We use reserved identifiers so that conforming programs can define non-reserved names globally and still use the standard library.
For example:
#define PopulationIterator char* #define SampleIterator char* #define UniformRandomNumberGenerator &rand #include <experimental/algorithm>
and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.
n4335 does not seem to require that this is only available in C++14. And aren't names under std:: already reserved, and an addition of a new std:: name could not break compliant code?
I would prefer not to restrict these things to c++14 only, although there may be a reason that conforming implementations need to that isn't visible in n4335.
However because sample is a function it can be looked up by ADL. Since sample takes user-defined iterator types and RNG's it is possible that existing unqualified calls to sample in user code may now find std::sample.
Good point, will do.
and guard the definition of sample inside a C++14 only block because adding a new definition of could break existing code.
n4335 does not seem to require that this is only available in C++14. And aren't names under std:: already reserved, and an addition of a new std:: name could not break compliant code?
I would prefer not to restrict these things to c++14 only, although there may be a reason that conforming implementations need to that isn't visible in n4335.
However because sample is a function it can be looked up by ADL. Since sample takes user-defined iterator types and RNG's it is possible that existing unqualified calls to sample in user code may now find std::sample.
That's a valid concern, but limiting this definition to C++14 will still break existing C++14 code. Is it really that much better than breaking C++11 and C++14 at the same time?
Also, that would be std::experimental::sample.
This is probably a conversation worth having on a larger scale. I think the rational for keeping in C++14 only is:
- N4335 references the C++14 standard as its base document
- C++11 should be "stable". Additions added to the language after C++11 not break valid code.
However I think these new features are useful enough that there is a strong motivation to provide them whenever possible. Also, in order for this to break existing code the user has to be including <experimental/*> headers anyway. Furthermore I believe we already allow string_view to be used in C++11.
I'll open up a bug to discuss this issue.
Switched to reserved identifiers.
include/experimental/algorithm | ||
---|---|---|
50 | done | |
54 | This thing is costing us lots of performance - it's just way too fancy. Do you thing smth like (g() - g::min()) % ( __k + 1) would be acceptable? |
@Eugenius, you should probably take a look at LWG issue #2463
http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-active.html#2463
I don't think it will require a code change; but you should make sure that we satisfy it.
We do. Thanks for pointing this out.
The code does exactly one pass over [first, last), and each iteration has constant complexity.
However because sample is a function it can be looked up by ADL. Since sample takes user-defined iterator types and RNG's it is possible that existing unqualified calls to sample in user code may now find std::sample.
No, wait, there is no std::sample. It's std::experimental::sample. And the bad ADL could only happen if one of the arguments is in std::experimental, which probably means it's intentional.
../1.cc:9:3: error: no member named 'sample' in namespace 'std'; did you mean 'std::experimental::sample'?
std::sample(a, b, c, d, t);
I'm very confused here.
The LFTS document references N3925 as the source of the sample proposal.
That paper has sample code, and talks about both "reservoir sampling" and "selection sampling", and using one for random access iterators and the other for forward iterators, and includes a reference implementation. This code looks nothing like that.
I also don't see any tests that deal with:
- Stability "Stable if and only if PopulationIterator meets the requirements of a ForwardIterator type."
- Empty input ranges. (should write nothing)
- test for n == 0 (should write nothing)
- No tests with different kinds of iterators.
Two more comments:
- In the libc++ test directory (support), there's a file full of iterator adapters that take a int * (say), and turn it into a forward iterator (for example). Makes writing tests simpler.
- Here's another test candidate: PopulationIteartor's value type shall be writeable to out.
That means that you should be able to generate a sample of (say) doubles from a population of ints.
- SampleIterator shall meet the additional requirements of a RandomAccessIterator type
unless PopulationIterator meets the additional requirements of a ForwardIterator
type.
So if I call sample(InputIterator, InputIterator, ForwardIterator, 1, rnd), I should get a compile error.
The specs suggest that we define a feature testing macro, but I could not find a precedent for that in libc++ include/experimental.
The new tests don't test the instability of the reservoir sampling algorithm. Should they?
The new tests don't test the instability of the reservoir sampling algorithm. Should they?
Yes - but I won't hold up accepting this because that test is missing.
The phrase in the standard "stable if and only if"...
test/std/experimental/algorithms/alg.random.sample/sample.stable.pass.cpp | ||
---|---|---|
45 | Couldn't this just be a call to std::is_sorted(first, last) ? |
include/experimental/algorithm | ||
---|---|---|
86 | I think that should forward the RNG: _VSTD::forward<_UniformRandomNumberGenerator>(__g) |
include/experimental/algorithm | ||
---|---|---|
41 | done |
More comments. The __sample that takes an input iterator and a random access iterator LGTM.
include/experimental/algorithm | ||
---|---|---|
59 | _Distance is the wrong type to be using here. It's possible that _Distance is a type that cannot represent the value of __k. The LFTS spec seems misleading about how distance should be used. I'll have to file a standard defect to clarify the wording. I think __k should be the distance_type of the _PopulationIterator. Example Code (This has UB and exits with -6 on my machine). std::vector<int> pop(1024, 42); std::vector<int> out(128, 0); std::minstd_rand g; std::experimental::sample(pop.begin(), pop.end(), out.begin(), (signed char)127, g); | |
75 | Do we need to re-construct the uniform_int_distribution every iteration? |
Since it is possible that we will not guard std::experimental::random_sample in a C++14 only block the tests should compile in C++03. Here is my patch to get them working. https://gist.github.com/EricWF/bef9f2a3ca4c86cb0781.
PS. you can pass --param=std=c++03 when invoking LIT to run the tests in C++03. Example:
lit -sv --param=std=c++03 test/std/experimental/algorithm
I'm fairly confident in the selection sampling algorithm as well. However I'm not quite ready to sign off on it. The algorithm as Knuth describes it generates its random numbers slightly differently.
Knuth lists the following steps for selection sampling:
- Generate a random number U uniformly distributed between zero and one.
- if (unsampled_size * U) < remaining_inserts then choose that element.
However this implementation of sample does:
- Generate a random number U uniformly distributed between zero and unsampled_size - 1.
- if U < remaining_inserts then choose that element.
My concern is that semantic change could introduce a bias. However I'm bad at math and stats so I don't know.
I want to think on it a little more to make sure they are equivalent.
I've also created a patch of various changes and additions related to adding the <experimental/algorithm> header and making the tests run in C++03 (this invalidates the above patch.). Please consider applying this to your patch.
https://gist.github.com/EricWF/c62b0979202cfd2cc6af
include/experimental/algorithm | ||
---|---|---|
74 | Same note about using _Distance |
Correct me if I am wrong, but as I read the standard it requires that if _PopulationIterator meets the requirements of ForwardIterator then we have to use selection sampling (because it is stable). Currently this requirement does not hold. If _PopulationIterator is a forward iterator and _SampleIterator is a random access iterator then reservoir sampling is used.
One way to fix this would be to dispatch to __sample(...) using either true_type or false_type instead of two iterator tags.
Here is an updated patch that fixes most of my concerns: https://gist.github.com/EricWF/c62b0979202cfd2cc6af
I've made the following changes:
- Fixes issues regarding the type of _Distance.
- Always select the selection sample algorithm when _PopulationIter is a ForwardIterator.
- Include <__undef_min_max> because min/max are used.
- Add _LIBCPP_DEBUG assert when Distance is negative.
- Get tests compiling in C++03.
- Adds more tests relating to adding the header <experimental/algorithm>.
- Update double include header test to include <experimental/algorithm>.
- Update the documentation regarding the status of LFTS Algorithms.
include/experimental/algorithm | ||
---|---|---|
75 | Not sure what I was thinking. Obviously we do. |
include/experimental/algorithm | ||
---|---|---|
75 | Well, uniform_int_distribution has an overload of operator() that can change the distribution parameters without re-creating the object, but it does exactly the same amount of work. |
This should probably have the system header pragma found in most other headers.