This is an archive of the discontinued LLVM Phabricator instance.

Add OrderedSet, with constant-time insertion and removal, and random access iteration.
AbandonedPublic

Authored by labrinea on Jul 6 2018, 8:44 AM.

Details

Summary

Implements a hash table with constant-time insertion and removal, and iteration in some consistent order across runs. Two copies of each element are kept, one is stored in a std::vector for random access iteration, and one is used as a key in a DenseMap, which contains indexes to the vector.

Diff Detail

Event Timeline

labrinea created this revision.Jul 6 2018, 8:44 AM
mgrang added a subscriber: mgrang.Jul 6 2018, 11:02 AM

Some nits.

include/llvm/ADT/OrderedSet.h
35

happend -> happened

40

Space needed around arithmetic operator.

42

Space needed around arithmetic operator.

71

Space needed around arithmetic operator.

The extra copy of each value is a bit unfortunate, but it would be a lot of work to fix (you'd basically have to reimplement DenseMap), and it's not a big deal for pointer-sized keys, so should be fine to leave that as-is for now.

I guess this is fine, unless someone else has a better suggestion.

chandlerc requested changes to this revision.Jul 6 2018, 7:12 PM

We already have SetVector which is widely used for these patterns. If we need both, we need a clear explanation of the difference and why we need both (IE, why users of one can't use the other).

This revision now requires changes to proceed.Jul 6 2018, 7:12 PM

We already have SetVector which is widely used for these patterns. If we need both, we need a clear explanation of the difference and why we need both (IE, why users of one can't use the other).

The remove operation on a SetVector can cost linear time and this might be an issue in some cases. Please have a look at the review comments of https://reviews.llvm.org/D48372

We already have SetVector which is widely used for these patterns. If we need both, we need a clear explanation of the difference and why we need both (IE, why users of one can't use the other).

The remove operation on a SetVector can cost linear time and this might be an issue in some cases. Please have a look at the review comments of https://reviews.llvm.org/D48372

Assuming we end up with two ADTs, we'll need: better names, clear header documentation of what's different, and an explanation in https://llvm.org/docs/ProgrammersManual.html.

Personally, I'd be in favour of changing MapVector to just do this, since it already has an index. I'd also be in favour of adding an index to SetVector to match. There'd be some work to do in auditing/updating users, but I doubt there are many calls to erase() and I'm skeptical that anyone relies on the current order for something other than determinism.

Personally, I'd be in favour of changing MapVector to just do this, since it already has an index. I'd also be in favour of adding an index to SetVector to match. There'd be some work to do in auditing/updating users, but I doubt there are many calls to erase() and I'm skeptical that anyone relies on the current order for something other than determinism.

I think we should not change MapVector. Its underlying data structure is a vector with <key,value> pairs but we only need to store a single value. Moreover, both MapVector and SetVector provide insertion order iteration and their users might rely on that. I am not sure we should change that. SmallSet might be a good alternative for lower insertion/deletion complexity, but at the moment it does not provide iteration over the underlying data structures. It uses a SmallVector for small data sets and expands it to an std::set if it grows too much. Is there a way to abstract those underlying iterators? If not I think we should keep this new ADT and rename/document it properly.

rnk added a comment.Jul 23 2018, 11:57 AM

Personally, I'd be in favour of changing MapVector to just do this, since it already has an index. I'd also be in favour of adding an index to SetVector to match. There'd be some work to do in auditing/updating users, but I doubt there are many calls to erase() and I'm skeptical that anyone relies on the current order for something other than determinism.

I think we should not change MapVector. Its underlying data structure is a vector with <key,value> pairs but we only need to store a single value. Moreover, both MapVector and SetVector provide insertion order iteration and their users might rely on that. I am not sure we should change that. SmallSet might be a good alternative for lower insertion/deletion complexity, but at the moment it does not provide iteration over the underlying data structures. It uses a SmallVector for small data sets and expands it to an std::set if it grows too much. Is there a way to abstract those underlying iterators? If not I think we should keep this new ADT and rename/document it properly.

I like @dexonsmith's suggestion. @labrinea, you don't have to change MapVector, but I think it's worth changing SetVector to maintain the vector indices in its map and to make erase O(1). Refactoring them to use the same implementation and implementing this optimization for MapVector can be done later.

labrinea abandoned this revision.May 24 2023, 4:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2023, 4:55 AM