This is an archive of the discontinued LLVM Phabricator instance.

Add standard insert overloads to StringMap
ClosedPublic

Authored by dblaikie on Jun 15 2014, 6:17 PM.

Details

Summary

This patch adds to StringMap an overload of insert taking a std::pair<K, V>, following the standard associative container interface. This was suggested in http://reviews.llvm.org/D4130

Diff Detail

Event Timeline

K-ballo updated this revision to Diff 10436.Jun 15 2014, 6:17 PM
K-ballo retitled this revision from to Add standard insert overloads to StringMap.
K-ballo updated this object.
K-ballo edited the test plan for this revision. (Show Details)
K-ballo added reviewers: timurrrr, dblaikie, bkramer.
K-ballo added a subscriber: Unknown Object (MLST).
dblaikie edited edge metadata.Jun 16 2014, 12:01 AM

The test case looks a bit thin - it doesn't test the return value of
insert, only tests one of the two insert functions the patch
introduces, and doesn't test positive and negative cases of insert
(pre-existing element and new insertion)

Does this function have some/all redundancy with existing interface?
(should the other insert-like APIs be refactored to call insert (&
perhaps eventually removed in favor of just using insert?))

In the other code review you mention StringMap has an insert that
returns bool? Yet you're not modifying that version here? What
arguments does it take that make it distinct from the two functions
you're adding?

(though don't get me wrong, this seems like a good change to make)

In D4153#4, @dblaikie wrote:

The test case looks a bit thin - it doesn't test the return value of
insert, only tests one of the two insert functions the patch
introduces, and doesn't test positive and negative cases of insert
(pre-existing element and new insertion)

I will add those.

Does this function have some/all redundancy with existing interface?
(should the other insert-like APIs be refactored to call insert (&
perhaps eventually removed in favor of just using insert?))

The new overloads behave exactly like GetOrCreateValue, except that the parameters and return types are different. I'll reimplement it in terms of insert.

In the other code review you mention StringMap has an insert that
returns bool? Yet you're not modifying that version here? What
arguments does it take that make it distinct from the two functions
you're adding?

The existing insert is defined as bool insert(MapEntryTy *KeyValue), it takes a pointer to an already created MapEntryTy and acquires ownership of it. AFAIK, this is the only piece of functionality that would not be covered by a standard associative sequence interface.

In D4153#5, @K-ballo wrote:
In D4153#4, @dblaikie wrote:

The test case looks a bit thin - it doesn't test the return value of
insert, only tests one of the two insert functions the patch
introduces, and doesn't test positive and negative cases of insert
(pre-existing element and new insertion)

I will add those.

Does this function have some/all redundancy with existing interface?
(should the other insert-like APIs be refactored to call insert (&
perhaps eventually removed in favor of just using insert?))

The new overloads behave exactly like GetOrCreateValue, except that the parameters and return types are different. I'll reimplement it in terms of insert.

Great, thanks!

In the other code review you mention StringMap has an insert that
returns bool? Yet you're not modifying that version here? What
arguments does it take that make it distinct from the two functions
you're adding?

The existing insert is defined as bool insert(MapEntryTy *KeyValue), it takes a pointer to an already created MapEntryTy and acquires ownership of it. AFAIK, this is the only piece of functionality that would not be covered by a standard associative sequence interface.

Do any of the callers (at a glance) do anything fancy with this? Do they get a pre-allocated MapEntryTy object from somewhere far flung (from another map, from this map under another key, etc) - or do they just allocate it and pass it in immediately?

If it doesn't seem to add any value, it'd be nice to remove it (eventually - that task doesn't have to be yours to bear, but if it looks like that's the right path then a FIXME comment suggesting that direction could be useful).

include/llvm/ADT/StringMap.h
359

It looks like the only difference between these functions is the "std::move" here - if we can guarantee reasonably cheaply-movable value types, we could replace the two overloads with one function that takes the std::pair by value, and std::moves the value unconditionally.

If it seems like there are some types that will never be cheaply movable, then keeping the two overloads might be necessary - but the common suffix and prefix could be factored into another function (or the whole function could be refactored into one common utility with a lambda to choose how to process the value... ). Considering how to handle the deduplication across other operations could help point to the right factoring.

In the other code review you mention StringMap has an insert that
returns bool? Yet you're not modifying that version here? What
arguments does it take that make it distinct from the two functions
you're adding?

The existing insert is defined as bool insert(MapEntryTy *KeyValue), it takes a pointer to an already created MapEntryTy and acquires ownership of it. AFAIK, this is the only piece of functionality that would not be covered by a standard associative sequence interface.

Do any of the callers (at a glance) do anything fancy with this? Do they get a pre-allocated MapEntryTy object from somewhere far flung (from another map, from this map under another key, etc) - or do they just allocate it and pass it in immediately?

If it doesn't seem to add any value, it'd be nice to remove it (eventually - that task doesn't have to be yours to bear, but if it looks like that's the right path then a FIXME comment suggesting that direction could be useful).

I tried commenting out that overload, and I can immediately see ValueSymbolTable::reinsertValue (llvm/lib/IR/ValueSymbolTable.cpp:37) doing a reinsertion of an already allocated entry. It doesn't seem like removing it would be an option, given the special nature of StringMap<>::MapEntryTy.

include/llvm/ADT/StringMap.h
359

I noticed that GetOrCreateValue is already taking arguments by value and moving unconditionally, so I will be taking that approach for the next iteration of the patch.

K-ballo updated this revision to Diff 10460.Jun 16 2014, 2:48 PM
K-ballo edited edge metadata.

Changes in this iteration:

  • Simplified implementation.
  • Corrected doxygen comment.
  • Reimplemented GetOrCreateValue on top of the new insert.
  • Improved test case.

By having GetOrCreateValue call the new insert overload internally, I was able to detect a gross oversight on my first implementation. Inserting an element in the map can trigger a table rehashing, in which case the returned iterator was pointing to the wrong bucket. I sorted this in the simplest way that I could think that would give maximum performance, and that is by modifying RehashTable to map an old bucket number into the corresponding new bucket number. I would like to add a test case that triggers a table rehash but I'm not sure how to do that, suggestions?

In D4153#8, @K-ballo wrote:

Changes in this iteration:

  • Simplified implementation.
  • Corrected doxygen comment.
  • Reimplemented GetOrCreateValue on top of the new insert.
  • Improved test case.

By having GetOrCreateValue call the new insert overload internally, I was able to detect a gross oversight on my first implementation. Inserting an element in the map can trigger a table rehashing, in which case the returned iterator was pointing to the wrong bucket. I sorted this in the simplest way that I could think that would give maximum performance, and that is by modifying RehashTable to map an old bucket number into the corresponding new bucket number.

Seems like a reasonable fix.

I would like to add a test case that triggers a table rehash but I'm not sure how to do that, suggestions?

Not sure, actually. It looks like the growth function is the usual sort of thing, so probably just inserting a second element should cause a rehash - no?

include/llvm/ADT/StringMap.h
338

Do you need the "NewItem" variable? Could you just assign straight to Bucket?

(you could do the Bucket == getTombstoneVal() before the MapEntryTy::Create? Or does Create need to worry about the old tombstone count?)

lib/Support/StringMap.cpp
214 ↗(On Diff #10460)

Please remove formatting changes to lines unrelated to your change.

If you're using clang-format to reformat, you can use the git/diff clang format scripts to just reformat the lines your code changes.

unittests/ADT/StringMapTest.cpp
214

That's an annoyingly creepy difference (first being a function, second being a value) - probably be good if first was a StringRef member to begin with. Again, not your problem though.

220

While it's true this (testValue + 1) is an xvalue, it's not really an interesting xvalue - a test case with a move-only value type might be more interesting and fully exercise the std::move stuff in insert.

I would like to add a test case that triggers a table rehash but I'm not sure how to do that, suggestions?

Not sure, actually. It looks like the growth function is the usual sort of thing, so probably just inserting a second element should cause a rehash - no?

I did not see this in my previous tests using up to 10 values, because the hash table starts with 16 buckets. I see that I can specify how many buckets to use on construction, so creating a map with only 1 bucket and adding two elements to it would trigger a rehash. Then I'd have to make sure that the rehash actually moves the bucket around, which is the tricky part.

include/llvm/ADT/StringMap.h
338

I followed the implementation of GetOrCreateValue here, but I don't see any obvious reason why I couldn't just do the assignment directly.

lib/Support/StringMap.cpp
214 ↗(On Diff #10460)

Thanks for that tip! I was wondering how people were able to use clang-format in situations like these.

unittests/ADT/StringMapTest.cpp
214

I had initially added the insert overload doing batch insertion, which is an interesting one to have as it could avoid extra table rehashing. I did not include it in this patch because of the weirdness of StringMapIterator, which isn't even an _Iterator_.

This would be a big breaking change, but if that is something acceptable I could look into making StringMap more like an std::map, or at least more like DenseMap. This would be of course not part of this patch.

220

My intention here is actually to test that the value is not updated when the pair already exists in the map. Since insert is taking the argument by value, I did not considered adding a special test case for moves. In order for such test to be meaningful, it would have to keep track of the number of copies, and StringMap doesn't fully work with movable-only types (StringMapEntry::set_value makes copies).

K-ballo updated this revision to Diff 10473.Jun 16 2014, 6:08 PM

Changes in this iteration:

  • Removed extra variable in insert.
  • Removed unwanted changes from clang-format.
  • Added a test case (attempt) for insertion where rehashing happens, and the bucket pointing to the element changes. This is a bad test, but I'm including it as a base for discussion on how should such a test look, if it's even sensible to have it.
dblaikie added inline comments.Jun 19 2014, 7:17 AM
unittests/ADT/StringMapTest.cpp
231

What is this EXPECT test for?

K-ballo added inline comments.Jun 19 2014, 11:08 AM
unittests/ADT/StringMapTest.cpp
231

As I said before, this test is not correct. In order for the test case to be useful, it has to have knowledge of the hashing function and the internals of StringMap. If either of those change, this test might became meaningless. That EXPECT was my attempt to have this test fail if the hashing function changes, but is of course wrong. I really don't know how to reliably reproduce a situation that depends on several implementation details.

dblaikie added inline comments.Jun 19 2014, 11:12 AM
unittests/ADT/StringMapTest.cpp
231

What change to the hash value fo the string could cause the test to become invalid, and in what way would the test be invalid?

Is the concern that the hashed value of the string might fit in the first bucket that's already there? (why doesn't it anyway? is that one bucket a sentinel? If it's a sentinel then it doesn't matter what the hashed value of the string is - it'll always cause a rehashing/growth)

In either case, the EXPECT num buckets == 1, insert, then EXPECT num buckets == 2 seems to suffice to verify that a rehash/growth has occurred, does it not?

I'm actually OK with the test just doing that (dropping the hash value checking, keeping the "number of buckets changed" check) and putting a big comment explaining how brittle this test is.

K-ballo added inline comments.Jun 19 2014, 11:43 AM
unittests/ADT/StringMapTest.cpp
231

Here is what happens: There is only one bucket in the map, so the insert naturally places the (pointer to the) new element there. Since the map is full, more buckets are added and elements are rehashed. For this particular input, the rehash causes the (pointer to the) new element to be moved to the second bucket. That's the condition that I would like to test for.

I am fine with your suggestion, as I don't have a better way to address this. I'll work on those changes.

dblaikie accepted this revision.Jun 19 2014, 11:47 AM
dblaikie edited edge metadata.

Sounds like we're on the right track - last changes are minor (just to remove the hash value check). Please commit once that's done.

unittests/ADT/StringMapTest.cpp
231

OK, so the bit I was missing was that you need a specific hash function to cause the element to move buckets... yeah, that's tricky.

I don't mind not checking the hash function still. Yes, if someone ripped that code out and didn't update the bucket during insertion the test might still pass. Pity, not sure there's much to be done about it.

This revision is now accepted and ready to land.Jun 19 2014, 11:47 AM
K-ballo updated this revision to Diff 10652.Jun 19 2014, 12:37 PM
K-ballo updated this object.
K-ballo edited edge metadata.

Changes in this iteration:

  • Removed hash check.
  • Added comment explaining how brittle the rehash test case is.

David: I don't have commit access, could you take care of that?

dblaikie commandeered this revision.Jun 19 2014, 12:46 PM
dblaikie edited reviewers, added: K-ballo; removed: dblaikie.
This revision now requires review to proceed.Jun 19 2014, 12:46 PM
dblaikie accepted this revision.Jun 19 2014, 1:17 PM
dblaikie added a reviewer: dblaikie.

Committed in r211309.

Had to fix some sign compare warnings by using 'u' after a few constants in the test cases. Usual annoyance with the EXPECT macros doing type deduction, etc...

Fixed the grammar mistake Alp mentioned in review.

This revision is now accepted and ready to land.Jun 19 2014, 1:17 PM
dblaikie closed this revision.Jun 19 2014, 1:18 PM