This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Fix the return value of max_size()
ClosedPublic

Authored by ldionne on Nov 22 2021, 1:47 PM.

Details

Reviewers
None
Group Reviewers
Restricted Project
Commits
rG2db67e97712e: [libc++] Fix the return value of max_size()
Summary

I assume nobody ever uses std::string_view::max_size() outside of
testing. However, we should still return a value that is based on
something with a reasonable rationale. Previously, we would forget
to take into account the size of the character type stored in the
string. This patch factors both that and the size of the string_view
itself into account.

Thanks to @mclow.lists for pointing out this issue.

Diff Detail

Event Timeline

ldionne requested review of this revision.Nov 22 2021, 1:47 PM
ldionne created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2021, 1:47 PM
Herald added a reviewer: Restricted Project. · View Herald Transcript
Quuxplusone added inline comments.
libcxx/include/string_view
359

(1) This seems unnecessarily fiddly; I don't really see anything wrong with the original.
(2) If we're going to do this, let's add a line break somewhere.
(3) Since sv.size() cannot physically return anything bigger than PTRDIFF_MAX, you probably want to use ptrdiff_t instead of size_type here... or maybe return std::min(PTRDIFF_MAX, SIZE_MAX / sizeof(value_type))? But again, this is needlessly fiddly and I see nothing wrong with just returning SIZE_MAX unconditionally. Literally nobody should ever be calling this method.

Do any of the other methods (e.g. the constructor) need to do something special to preserve the postcondition that sv.size() <= sv.max_size()? Before this change, the postcondition was vacuously true; now, it could conceivably be false, and maybe we need to check for that?

mclow.lists added inline comments.Nov 22 2021, 4:31 PM
libcxx/include/string_view
359

My preference is return (numeric_limits<size_type>::max() / sizeof(value_type) since this is the max number of elements that a string_view can hold.

As for "literally nobody should be calling this method", I agree, with the exception of people writing test suites.

libcxx/test/std/strings/string.view/string.view.capacity/capacity.pass.cpp
51

this is just not true, IMHO.

The sequence of value_types that a string_views pointer points at can (theoretically) be up to size_t bytes long.

ldionne marked 2 inline comments as done.Nov 24 2021, 2:40 PM
ldionne added inline comments.
libcxx/include/string_view
359

@Quuxplusone

(1) This seems unnecessarily fiddly; I don't really see anything wrong with the original.

If the character type is more than 1 byte long, the original is wrong.

@mclow.lists

My preference is return (numeric_limits<size_type>::max() / sizeof(value_type) since this is the max number of elements that a string_view can hold.

Yeah, I agree now. I was using - sizeof(*this) as some sort of poor man's way of removing the current string_view's memory usage from the lot. However, this is entirely misguided -- re-reading the definition of size_t:

std::size_t can store the maximum size of a theoretically possible object of any type (including array).

It doesn't have anything to do with the amount of memory usable on a system, it's only the maximum size of a single object. So yeah, I agree with your suggested definition.

libcxx/test/std/strings/string.view/string.view.capacity/capacity.pass.cpp
51

Yeah, agreed too -- this was basically the same misconception I explained above. Will fix this, thanks for indirectly teaching me something :-).

ldionne updated this revision to Diff 389624.Nov 24 2021, 2:55 PM
ldionne marked 2 inline comments as done.

Apply reviewer feedback. Thanks!

FWIW, I'm still of the (very firmly held) opinion that this patch is unnecessary. max_size() is always just an upper bound on the size of the sequence. Lowering that upper bound

  • might introduce actual factual errors, e.g. if we get the bound wrong — whereas size_t(-1) can never be wrong
  • will never help any user of libc++, because a loose upper bound (as returned by max_size) is never useful and always useless
  • is no closer to being "accurate," because it is just as impossible in practice to have a string_view of length 9223372036854775807 or 4611686018427387903, as to have a string_view of length 18446744073709551615. These are all equally lying numbers, and there's no point pretending that any one of them is "more useful" or "more accurate" than any other.

FWIW, I'm still of the (very firmly held) opinion that this patch is unnecessary. max_size() is always just an upper bound on the size of the sequence. Lowering that upper bound

  • might introduce actual factual errors, e.g. if we get the bound wrong — whereas size_t(-1) can never be wrong
  • will never help any user of libc++, because a loose upper bound (as returned by max_size) is never useful and always useless
  • is no closer to being "accurate," because it is just as impossible in practice to have a string_view of length 9223372036854775807 or 4611686018427387903, as to have a string_view of length 18446744073709551615. These are all equally lying numbers, and there's no point pretending that any one of them is "more useful" or "more accurate" than any other.

Consider a 32 bit system. numeric_limits<size_t>::max() is 4294967295.
Now consider: string_view(p, 3'000'000'000); i.e, a string view that encompasses 3GB of RAM.

That works today. I will concede that if you try to calculate distance(begin(), end()) you get an overflow, but all the other calls work fine. You can enumerate the elements, you can resize, etc.

FWIW, I'm still of the (very firmly held) opinion that this patch is unnecessary. max_size() is always just an upper bound on the size of the sequence. Lowering that upper bound

  • might introduce actual factual errors, e.g. if we get the bound wrong — whereas size_t(-1) can never be wrong

Which kinds of errors would you be worried about? The sizeof division itself isn't so complicated that I'd be worried about it being wrong.

  • is no closer to being "accurate," because it is just as impossible in practice to have a string_view of length 9223372036854775807 or 4611686018427387903, as to have a string_view of length 18446744073709551615. These are all equally lying numbers, and there's no point pretending that any one of them is "more useful" or "more accurate" than any other.

It's a bit unfortunate though that a basic_string_view<char16_t> of size 9223372036854775808 has a memory size of 0 bytes on a 64-bit system. (Or does it? The problem is that unsigned overflow is well-defined.)

Sure, the other numbers are also highly suspicious, but there are different kinds of wrong.

libcxx/include/string_view
359

Isn't the point of max_size to return the maximal size so that address or memory consumption computation doesn't wrap around or overflow? So I think this change is correct. Surely max_size is more relevant in (owning) containers, but staying below should guarantee that size() * sizeof(value_type) doesn't overflow.

Do any of the other methods (e.g. the constructor) need to do something special to preserve the postcondition that sv.size() <= sv.max_size()? Before this change, the postcondition was vacuously true; now, it could conceivably be false, and maybe we need to check for that?

One might add an assertion to the basic_string_view(const _CharT*, size_type) constructor, but it'll probably not find anything relevant.

@aaronpuchert wrote:

Which kinds of errors would you be worried about? The sizeof division itself isn't so complicated that I'd be worried about it being wrong.

Well, the original PR had had a division-and-subtraction, and then was updated to just a division when the original math turned out to have been factually wrong. https://reviews.llvm.org/D114395?vs=389034&id=389624#toc This follows the "Less code = less bugs" principle: division+subtraction turned out to have been wronger than division, and I'm just saying that division in turn is potentially wronger than no-division (and cannot possibly be righter than no-division, since no-division is a factually correct upper bound by definition).

@mclow.lists writes:

Consider a 32 bit system... string_view(p, 3'000'000'000) ... That works today.

Yes, and it will keep working after any of these proposed changes, right? We just need to maintain string_view::max_size() > 3000000000.
The interesting case for this PR is u16string_view(p, 1'500'000'000)... which is still satisfied by both the existing size_t(-1) and the proposed size_t(-1) / sizeof(char16_t). So the question is just whether any libc++ user will be looking at u16string_view::max_size() and making decisions based on its value, and if so, what value will lead to them making better decisions — size_t(-1), or size_t(-1)-divided-by-something.

libcxx/include/string_view
359

Do any of the other methods (e.g. the constructor) need to do something special to preserve the postcondition that sv.size() <= sv.max_size()?

Actually, I'm now reasonably confident that we should not assert or throw in this situation. It would even be worthwhile to add a unit test for this:

std::u16string_view sv(nullptr, size_t(-1));  // all preconditions are met; this must work
assert(sv.size() == size_t(-1));

It's true that accessing sv[42] would be UB at this point, but I think the Standard is 100% clear about the values that sv's data members must have, after that constructor call.

However, I admit that this is irrelevant to what value we return for max_size(). It's OK for sv.size() > sv.max_size(). (string_view is different from string in this respect.)

mclow.lists added a comment.EditedNov 29 2021, 2:07 PM

The definition of max_size() is the maximum number of elements that a string_view can hold.
(Actual definition: Returns: The largest possible number of char-like objects that can be referred to by a basic_string_view.)

Also, numeric_limits<size_t>::max() is the maximum size (in bytes) of any data structure.

My argument is that those two imply that sv.max_size() * sizeof(sv::value_type) (pardon the fak-ey code) should:

  • Not overflow.
  • Be no larger than numeric_limits<size_t>::max()

Interestingly enough:

  • libstdc++ returns: (numeric_limits<size_t>::max() - epsilon)/sizeof(Elem)
  • MSVC returns: min(numeric_limits<ptrdiff_t>::max(), numeric_limits<size_t>::max()/sizeof(Elem))
ldionne accepted this revision as: Restricted Project.Dec 6 2021, 10:52 AM

The definition of max_size() is the maximum number of elements that a string_view can hold.
(Actual definition: Returns: The largest possible number of char-like objects that can be referred to by a basic_string_view.)

Also, numeric_limits<size_t>::max() is the maximum size (in bytes) of any data structure.

My argument is that those two imply that sv.max_size() * sizeof(sv::value_type) (pardon the fak-ey code) should:

  • Not overflow.
  • Be no larger than numeric_limits<size_t>::max()

This is what sways me. I think we all agree that max_size() is kind of a ridiculous method. However, the current definition is more wrong than the one proposed in this patch based on the above criteria, which I think is good. Well, I don't really care that much, but I think it's no wronger than what we are doing anyways.

Accepting as libc++ based on Marshall's review.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 6 2021, 10:53 AM
This revision was automatically updated to reflect the committed changes.