This is an archive of the discontinued LLVM Phabricator instance.

[NFC] Add llvm::StaticVector ADT
Needs ReviewPublic

Authored by jplayer-nv on Oct 20 2021, 1:05 PM.

Details

Reviewers
dblaikie
Summary

I have implemented llvm::StaticVector, which is a fixed-capacity vector class. Think llvm::SmallVector without the ability to grow beyond its inline storage capacity. The llvm::StaticVector is lower overhead than an llvm::SmallVector with similar inline storage capacity:

Class storage savings:

  1. There is no data pointer; only the inline storage.
  2. The Size member can have its type scaled back depending on the stated capacity to save a few bytes. For example: sizeof(StaticVector<char,200>) == 201 since a uint8_t is sufficient to track the max number of elements.

Host generated code size savings:

  1. Heap buffer growth related code is not generated; thus eliminating control-flow + code size on common member functions.

If the user knows the required capacity at compile-time, then a StaticVector would be a more efficient choice than the jack-of-all-trades SmallVector. Compared to a std::array, there are nice properties to exploit as well. For example, not elements are constructed when an empty StaticVector object is instantiated (compared to a std::array, which constructs all its elements when instantiated).

Diff Detail

Event Timeline

jplayer-nv created this revision.Oct 20 2021, 1:05 PM
jplayer-nv requested review of this revision.Oct 20 2021, 1:05 PM

could fallible operations ala push_back return an llvm::error?

jplayer-nv added a comment.EditedOct 20 2021, 2:14 PM

could fallible operations ala push_back return an llvm::error?

I would think it better to return an llvm::Expected<reference> in that case so we can still return the newly inserted reference to the user. It would also break API compatibility with llvm::SmallVector, requiring more refactoring than simply changing the container type.

Additionally seems like unwanted overhead in a class which is meant to be a leaned version of SmallVector when we know the max capacity at host compile time.

If we follow through on this, I'd prefer add to the API rather than change the existing member functions (i.e. add member function push_back_fallible(), so users can opt-in to llvm::Expected return).

Relevant statements from the proposed std::static_vector spec:

Exception Safety
Two main answers were explored in the prototype implementation:

  1. Throw an exception.
  2. Make this a precondition violation.

Throwing an exception is appealing because it makes the interface slightly more similar to that of std::vector. However, which exception should be thrown? It cannot be std::bad_alloc, because nothing is being allocated. It could throw either std::out_of_bounds or std::logic_error but in any case the interface does not end up being equal to that of std::vector.

The alternative is to make not exceeding the capacity a precondition on the static_vector's methods. This approach allows implementations to provide good run-time diagnostics if they so desired, e.g., on debug builds by means of an assertion, and makes implementation that avoid run-time checks conforming as well. Since the mutating methods have a precondition, they have narrow contracts, and are not conditionally noexcept.

This proposal chooses this path and makes exceeding the static_vector's capacity a precondition violation that results in undefined behavior. Throwing checked_xxx methods can be provided in a backwards compatible way.

I subscribe to the checked_push_back() addition mentioned above.

Reverse include order of ArrayRef.h and StaticVector.h in unit test code to address pre-merge check failure.

If the unit test is the only user, then you are bringing dead code. Do you have a real user?

If the unit test is the only user, then you are bringing dead code. Do you have a real user?

+1 to this, basically - might be worth deferring reviewing/committing this until some use-case is confirmed? (I realize it creates somewhat of a chicken-or-egg problem - reviewing a dependent patch before the underlying one, but that might be beneficial to review the use case without having to worry about all these implementation details (knowing they're there, though, in terms of evaluating the total cost of the feature - so it's good to have this accessible for reference/high level understanding))

If the unit test is the only user, then you are bringing dead code. Do you have a real user?

+1 to this, basically - might be worth deferring reviewing/committing this until some use-case is confirmed? (I realize it creates somewhat of a chicken-or-egg problem - reviewing a dependent patch before the underlying one, but that might be beneficial to review the use case without having to worry about all these implementation details (knowing they're there, though, in terms of evaluating the total cost of the feature - so it's good to have this accessible for reference/high level understanding))

Off the top of my head, I know there are a bunch of SmallVectors in X86ISelLowering with inline size equal to the maximum size that will occur. For example, there's one in LowerVectorCTLZInRegLUT with inline size of 64. 64 is the maximum number of elements on AVX512.

I don't see too many simple substitutions in target independent code, although this class has a SmallVector which never exceeds the inline storage (drop-in type substitution + with YAML binding works):

include\llvm\DebugInfo\CodeView\TypeRecord.h:

// LF_BUILDINFO
class BuildInfoRecord : public TypeRecord {
public:
  BuildInfoRecord() = default;
  explicit BuildInfoRecord(TypeRecordKind Kind) : TypeRecord(Kind) {}
  BuildInfoRecord(ArrayRef<TypeIndex> ArgIndices)
      : TypeRecord(TypeRecordKind::BuildInfo),
        ArgIndices(ArgIndices.begin(), ArgIndices.end()) {}

  ArrayRef<TypeIndex> getArgs() const { return ArgIndices; }

  /// Indices of known build info arguments.
  enum BuildInfoArg {
    CurrentDirectory, ///< Absolute CWD path
    BuildTool,        ///< Absolute compiler path
    SourceFile,       ///< Path to main source file, relative or absolute
    TypeServerPDB,    ///< Absolute path of type server PDB (/Fd)
    CommandLine,      ///< Full canonical command line (maybe -cc1)
    MaxArgs
  };

  StaticVector<TypeIndex, MaxArgs> ArgIndices;
};

Is the above substitution sufficient to move forward with a review? Or would you like to see more wide-spread usage?

I suspect this container is going to be more attractive in target-specific code (as is my interest). Is it appropriate for me to modify target-specific code to roll out a container like this?

could fallible operations ala push_back return an llvm::error?

I implemented a subset of checked_XXX versions of fallible member functions this morning. If this is an attractive feature, I can work on a complete set of them and add them to the patch.

I don't see too many simple substitutions in target independent code, although this class has a SmallVector which never exceeds the inline storage (drop-in type substitution + with YAML binding works):

include\llvm\DebugInfo\CodeView\TypeRecord.h:

// LF_BUILDINFO
class BuildInfoRecord : public TypeRecord {
public:
  BuildInfoRecord() = default;
  explicit BuildInfoRecord(TypeRecordKind Kind) : TypeRecord(Kind) {}
  BuildInfoRecord(ArrayRef<TypeIndex> ArgIndices)
      : TypeRecord(TypeRecordKind::BuildInfo),
        ArgIndices(ArgIndices.begin(), ArgIndices.end()) {}

  ArrayRef<TypeIndex> getArgs() const { return ArgIndices; }

  /// Indices of known build info arguments.
  enum BuildInfoArg {
    CurrentDirectory, ///< Absolute CWD path
    BuildTool,        ///< Absolute compiler path
    SourceFile,       ///< Path to main source file, relative or absolute
    TypeServerPDB,    ///< Absolute path of type server PDB (/Fd)
    CommandLine,      ///< Full canonical command line (maybe -cc1)
    MaxArgs
  };

  StaticVector<TypeIndex, MaxArgs> ArgIndices;
};

Is the above substitution sufficient to move forward with a review? Or would you like to see more wide-spread usage?

Not sure - what's your take, @craig.topper / @tschuett

It's still quite a bit of code, not sure what the tipping point in adding all that compared to what benefits would justify it? (whether there's a "this much code size" or "this much performance", etc, that would justify this)

I suspect this container is going to be more attractive in target-specific code (as is my interest). Is it appropriate for me to modify target-specific code to roll out a container like this?

could fallible operations ala push_back return an llvm::error?

I implemented a subset of checked_XXX versions of fallible member functions this morning. If this is an attractive feature, I can work on a complete set of them and add them to the patch.

  • Define the MSVC visualizer for StaticVector.
  • Bind StaticVector a simple yaml sequence.
  • Change StaticVector::Capacity to an enum to make it available from the visualizer.
  • Substitute some SmallVectors which never grow beyond inline storage.
mehdi_amini added inline comments.Oct 21 2021, 6:01 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
27958 ↗(On Diff #381414)

I'm really not convinced that this is a good use of this data structure: the invariant that makes 64 value enough is non trivial and hard to validate.
The code above is:

int NumElts = VT.getVectorNumElements();
int NumBytes = NumElts * (VT.getScalarSizeInBits() / 8);

It isn't clear from there where the 64 comes from and what guarantee we have it'll be enough "always".
More importantly: I'm not convinced that this is the kind of situation where we would want "by construction" to never exceed the SmallVector or if instead it just isn't best to continue to "just" have SmallVector as an optimization without change in the overall "behavior contract" that usual with vectors.

On the other hand, I find the YAML example more convincing.

I'd grep for std::array<llvm::Optional<...>, ...> — I suppose this would be a good replacement for at least some instances of that.

I'd grep for std::array<llvm::Optional<...>, ...> — I suppose this would be a good replacement for at least some instances of that.

I only found one instance of this construct in DWARF code. As far as I could tell, the array was being accessed at ad-hoc indexes and therefore isn't ideal for a conversion to StaticVector (i.e. you don't get to drop the Optional on the value_type). If you know of any more specific examples of above, let me know!

I'm not giving up here... but it doesn't seem like there are many more obvious + simple conversions. Unfortunately, the common construct I've converted in other code bases in the past is a C-array with a separate variable tracking the number of valid elements in that array. This is difficult / if not impossible to grep for.

My hope is that developers would run with this container if made available. So maybe this patch can just sit here until a suitable use is identified.

Here's another example of a possible StaticVector conversion:

llvm/lib/MC/WinCOFFObjectWriter.cpp

void WinCOFFObjectWriter::SetSectionName(COFFSection &S) {
...
  uint64_t StringTableEntry = Strings.getOffset(S.Name);
  if (StringTableEntry <= Max7DecimalOffset) {
    SmallVector<char, COFF::NameSize> Buffer;
    Twine('/').concat(Twine(StringTableEntry)).toVector(Buffer);
    assert(Buffer.size() <= COFF::NameSize && Buffer.size() >= 2);
    std::memcpy(S.Header.Name, Buffer.data(), Buffer.size());
    return;
  }
...
}

It's not as simple as a container type substitution since I haven't defined a raw_pwrite_stream for StaticVector. Assuming that's done, this shouldn't be a problem.

One nice benefit is that the container implicitly enforces:

assert(Buffer.size() <= COFF::NameSize);

Will keep searching.

Here's another example of a possible StaticVector conversion:

llvm/lib/MC/WinCOFFObjectWriter.cpp
...

Yeah that seems like the perfect case! (I don't think it's common in the compiler world though, but happy to be proven wrong).

Resurrecting this review for the following:

  1. I defined checked_ versions of member functions which may grow the container beyond its buffer capacity. All checked_ member functions return an llvm::Error or llvm::Expected.
  1. I found a more compelling potential use of the StaticVector as the llvm::SmallSet's underlying small storage.
Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2022, 3:13 PM

Call llvm::consumeError() when an Error is returned in the StaticVector unit test.

Also minor formatting + parameter type name consistency changes.

Here's another example of a possible StaticVector conversion:

llvm/lib/MC/WinCOFFObjectWriter.cpp
...

Yeah that seems like the perfect case! (I don't think it's common in the compiler world though, but happy to be proven wrong).

Yeah, even here it seems like an assert after populating the vector isn't especially costly/problematic - for me even a bunch of those across the codebase seems hard for me to justify the several thousands of lines similar vector code/testing/etc. That's my gut feeling at the moment at least.

Just to give a concrete example of what I believe to be the benefit behind StaticVector, I ran this contrived test pitting SmallVector<uint64_t,N> vs StaticVector<uint64_t,N> (with a few tweaks to ensure the push_back param is passsed by value like in SmallVector):

TYPED_TEST(StaticVsSmall, TimedStaticVector) {
  const size_t N = this->NumBuiltinElts(this->TheStaticVector);
  for (int I = 0; I < OutterLimit; ++I) {
    this->TheStaticVector.clear();
    for (int J = 0; J < N; ++J) {
      this->TheStaticVector.push_back(I + J);
    }
  }
}

TYPED_TEST(StaticVsSmall, TimedSmallVector) {
  const size_t N = this->NumBuiltinElts(this->TheSmallVector);
  for (int I = 0; I < OutterLimit; ++I) {
    this->TheSmallVector.clear();
    for (int J = 0; J < N; ++J) {
      this->TheSmallVector.push_back(I + J);
    }
  }
}

My local results are:

[ RUN      ] StaticVsSmall/0.TimedStaticVector
[       OK ] StaticVsSmall/0.TimedStaticVector (371 ms)
[ RUN      ] StaticVsSmall/0.TimedSmallVector
[       OK ] StaticVsSmall/0.TimedSmallVector (596 ms)

Which demonstrates that simple insertions are significantly faster (almost 38%) with the StaticVector compared to a SmallVector which stays within its small buffer limit.

I would think it interesting to speed up SmallSet insertions by retargeting the small storage to a StaticVector.

[ RUN      ] StaticVsSmall/0.TimedStaticVector
[       OK ] StaticVsSmall/0.TimedStaticVector (371 ms)
[ RUN      ] StaticVsSmall/0.TimedSmallVector
[       OK ] StaticVsSmall/0.TimedSmallVector (596 ms)

Which demonstrates that simple insertions are significantly faster (almost 38%) with the StaticVector compared to a SmallVector which stays within its small buffer limit.

Let me amend this a bit... it looks like the reason SmallVector::push_back is slower in this case is because of insufficient specialization. Here's the trivially copyable version:

void push_back(ValueParamT Elt) {
  const T *EltPtr = reserveForParamAndGetAddress(Elt);
  memcpy(reinterpret_cast<void *>(this->end()), EltPtr, sizeof(T));
  this->set_size(this->size() + 1);
}

When Elt is passed by value, reserveForParamAndGetAddress() is unnecessary since the parameter will not alias with the SmallVector's buffer. This inhibits host compiler optimization. If we were to SFINAE split this function, we can match my StaticVector::push_back perf. The following SmallVector change works for me locally:

  template <typename RetT = std::enable_if_t<TakesParamByValue>>
  RetT push_back_impl(value_type Elt) {
    if (LLVM_UNLIKELY(this->size() >= this->capacity()))
      this->grow();
    memcpy(reinterpret_cast<void *>(this->end()), std::addressof(Elt),
           sizeof(T));
    this->set_size(this->size() + 1);
  }

  template <typename RetT = std::enable_if_t<!TakesParamByValue>>
  RetT push_back_impl(const_reference Elt) {
    const T *EltPtr = reserveForParamAndGetAddress(Elt);
    memcpy(reinterpret_cast<void *>(this->end()), EltPtr, sizeof(T));
    this->set_size(this->size() + 1);
  }

public:
  void push_back(ValueParamT Elt) { push_back_impl(Elt); }
[ RUN      ] StaticVsSmall/0.TimedStaticVector
[       OK ] StaticVsSmall/0.TimedStaticVector (371 ms)
[ RUN      ] StaticVsSmall/0.TimedSmallVector
[       OK ] StaticVsSmall/0.TimedSmallVector (596 ms)

Which demonstrates that simple insertions are significantly faster (almost 38%) with the StaticVector compared to a SmallVector which stays within its small buffer limit.

Let me amend this a bit... it looks like the reason SmallVector::push_back is slower in this case is because of insufficient specialization. Here's the trivially copyable version:

void push_back(ValueParamT Elt) {
  const T *EltPtr = reserveForParamAndGetAddress(Elt);
  memcpy(reinterpret_cast<void *>(this->end()), EltPtr, sizeof(T));
  this->set_size(this->size() + 1);
}

When Elt is passed by value, reserveForParamAndGetAddress() is unnecessary since the parameter will not alias with the SmallVector's buffer. This inhibits host compiler optimization. If we were to SFINAE split this function, we can match my StaticVector::push_back perf. The following SmallVector change works for me locally:

  template <typename RetT = std::enable_if_t<TakesParamByValue>>
  RetT push_back_impl(value_type Elt) {
    if (LLVM_UNLIKELY(this->size() >= this->capacity()))
      this->grow();
    memcpy(reinterpret_cast<void *>(this->end()), std::addressof(Elt),
           sizeof(T));
    this->set_size(this->size() + 1);
  }

  template <typename RetT = std::enable_if_t<!TakesParamByValue>>
  RetT push_back_impl(const_reference Elt) {
    const T *EltPtr = reserveForParamAndGetAddress(Elt);
    memcpy(reinterpret_cast<void *>(this->end()), EltPtr, sizeof(T));
    this->set_size(this->size() + 1);
  }

public:
  void push_back(ValueParamT Elt) { push_back_impl(Elt); }

Nice find.

When we added this assertion, we checked the compile time suite and didn't get regressions. The theory being that the optimizer *should* have enough info to get to the same place on its own since it has this code reserveForParamAndGetAddress:

size_t NewSize = This->size() + N;
if (LLVM_LIKELY(NewSize <= This->capacity()))
  return &Elt;

bool ReferencesStorage = false;
int64_t Index = -1;
if (!U::TakesParamByValue) {
  if (LLVM_UNLIKELY(This->isReferenceToStorage(&Elt))) {
    ReferencesStorage = true;
    Index = &Elt - This->begin();
  }
}
This->grow(NewSize);
return ReferencesStorage ? This->begin() + Index : &Elt;

Ideally the outer if would be if constexpr(!U::TakesParamByValue) but that needs to wait for C++17.

The current implementation is easy to verify since the check is just in one place. Is it just a synthetic benchmark where it's causing a slowdown? Or, can we help the optimizer along somehow without changing the layering? (I don't feel strongly attached to exact design of the current thing, but IIRC it took some compromise at the time to make everyone happy. But I could be remembering the patch series incorrectly.)

If your host toolchain has C++17, I'd be curious if the regression goes away when you switch to that mode and use if constexpr.

I'd also be curious how this performs on your benchmark (after you fix the compiler errors / etc.):

template <class U, std::enable_if_t<U::TakesParamByValue, bool> = false>
static const T *reserveForParamAndGetAddressImpl(U *This, const T &Elt,
                                                 size_t N) {
  size_t NewSize = This->size() + N;
  if (LLVM_LIKELY(NewSize <= This->capacity()))
    return &Elt;

  This->grow(NewSize);
  return &Elt;
}
template <class U, std::enable_if_t<!U::TakesParamByValue, bool> = false>
static const T *reserveForParamAndGetAddressImpl(U *This, const T &Elt,
                                                 size_t N) {
  size_t NewSize = This->size() + N;
  if (LLVM_LIKELY(NewSize <= This->capacity()))
    return &Elt;

  bool ReferencesStorage = false;
  int64_t Index = -1;
  if (!U::TakesParamByValue) {
    if (LLVM_UNLIKELY(This->isReferenceToStorage(&Elt))) {
      ReferencesStorage = true;
      Index = &Elt - This->begin();
    }
  }
  This->grow(NewSize);
  return ReferencesStorage ? This->begin() + Index : &Elt;
}

Seems like it should be mostly equivalent to your solution, but the specialization is at a lower level where it's easy to verify the logic is equivalent... and it's also still obvious how if constexpr would apply once we have it.

tschuett added a comment.EditedApr 29 2022, 2:20 AM

You could use __cplusplus to determine the C++ version.

Or check __cpp_if_constexpr == 201606.

Ideally the outer if would be if constexpr(!U::TakesParamByValue) but that needs to wait for C++17.

The current implementation is easy to verify since the check is just in one place. Is it just a synthetic benchmark where it's causing a slowdown? Or, can we help the optimizer along somehow without changing the layering? (I don't feel strongly attached to exact design of the current thing, but IIRC it took some compromise at the time to make everyone happy. But I could be remembering the patch series incorrectly.)

Just tried compiling for C++20 and added the constexpr. No improvement in runtime. FYI I'm building on nearly the latest MSVC 2022 (just one update behind).

My benchmark is just a contrived loop of push_back calls to the small storage.

My time is a bit fragmented today... I'll give your suggested reserveForParamAndGetAddressImpl() changes a whirl later and report back.

If your host toolchain has C++17, I'd be curious if the regression goes away when you switch to that mode and use if constexpr.

I'd also be curious how this performs on your benchmark (after you fix the compiler errors / etc.):

template <class U, std::enable_if_t<U::TakesParamByValue, bool> = false>
static const T *reserveForParamAndGetAddressImpl(U *This, const T &Elt,
                                                 size_t N) {
  size_t NewSize = This->size() + N;
  if (LLVM_LIKELY(NewSize <= This->capacity()))
    return &Elt;

  This->grow(NewSize);
  return &Elt;
}
template <class U, std::enable_if_t<!U::TakesParamByValue, bool> = false>
static const T *reserveForParamAndGetAddressImpl(U *This, const T &Elt,
                                                 size_t N) {
  size_t NewSize = This->size() + N;
  if (LLVM_LIKELY(NewSize <= This->capacity()))
    return &Elt;

  bool ReferencesStorage = false;
  int64_t Index = -1;
  if (!U::TakesParamByValue) {
    if (LLVM_UNLIKELY(This->isReferenceToStorage(&Elt))) {
      ReferencesStorage = true;
      Index = &Elt - This->begin();
    }
  }
  This->grow(NewSize);
  return ReferencesStorage ? This->begin() + Index : &Elt;
}

Seems like it should be mostly equivalent to your solution, but the specialization is at a lower level where it's easy to verify the logic is equivalent... and it's also still obvious how if constexpr would apply once we have it.

The above change does improve the results, but not quite as much as specializing at the push_back level.

For my toy benchmark, I'm seeing:

ChangeTime (ms)% Reduction
TOT1190-
specialized reserveForParamAndGetAddressImpl75037.0%
specialized push_back69541.6%

I tried a couple minor variations of my benchmark and all generate similar results.

Thanks for running that. I just pushed https://github.com/dexonsmith/llvm-project/tree/perf/small-vector-specialize-reserveForParamAndGetAddressImpl, which should cause compile-time results to show up at http://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=dexonsmith eventually, and we can see if there's a meaningful impact outside of the benchmark. Not sure whether it's worth changing push_back() since even on the toy benchmark it's not a big difference.

Thanks for running that. I just pushed https://github.com/dexonsmith/llvm-project/tree/perf/small-vector-specialize-reserveForParamAndGetAddressImpl, which should cause compile-time results to show up at http://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=dexonsmith eventually, and we can see if there's a meaningful impact outside of the benchmark. Not sure whether it's worth changing push_back() since even on the toy benchmark it's not a big difference.

Also, it'd be interesting to look at what is different about the CodeGen, and hunt down why the optimizations are missed. Seems better to fix the optimizer than to add redundant specializations all over the place.

Thanks for running that. I just pushed https://github.com/dexonsmith/llvm-project/tree/perf/small-vector-specialize-reserveForParamAndGetAddressImpl, which should cause compile-time results to show up at http://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=dexonsmith eventually, and we can see if there's a meaningful impact outside of the benchmark. Not sure whether it's worth changing push_back() since even on the toy benchmark it's not a big difference.

Results here:
http://llvm-compile-time-tracker.com/compare.php?from=c1e17c7dfedd27b95c8c2fba2b6473c7348f0e77&to=0be905f5dec8ebc4bd0fc2ec498db88a617884fc&stat=instructions

Almost within the noise. One thing is "light green". @dblaikie , thoughts on whether it's worth landing? (I can send a Phab review)

Thanks for running that. I just pushed https://github.com/dexonsmith/llvm-project/tree/perf/small-vector-specialize-reserveForParamAndGetAddressImpl, which should cause compile-time results to show up at http://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=dexonsmith eventually, and we can see if there's a meaningful impact outside of the benchmark. Not sure whether it's worth changing push_back() since even on the toy benchmark it's not a big difference.

Also, it'd be interesting to look at what is different about the CodeGen, and hunt down why the optimizations are missed. Seems better to fix the optimizer than to add redundant specializations all over the place.

Yeah, it seems unfortunate that that produces any different code with or without the specialization, given it's one trivial/compile-time-constant condition & then a follow-on conditional operator.

Thanks for running that. I just pushed https://github.com/dexonsmith/llvm-project/tree/perf/small-vector-specialize-reserveForParamAndGetAddressImpl, which should cause compile-time results to show up at http://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=dexonsmith eventually, and we can see if there's a meaningful impact outside of the benchmark. Not sure whether it's worth changing push_back() since even on the toy benchmark it's not a big difference.

Results here:
http://llvm-compile-time-tracker.com/compare.php?from=c1e17c7dfedd27b95c8c2fba2b6473c7348f0e77&to=0be905f5dec8ebc4bd0fc2ec498db88a617884fc&stat=instructions

Almost within the noise. One thing is "light green". @dblaikie , thoughts on whether it's worth landing? (I can send a Phab review)

My gut would be probably not worth it - but if it's important to you/someone I could live with it. Though, yeah, would prefer to see the long-term fix, at least a comment near the specialization pointing to a bug that, once resolved, could then motivate removing the specialization agian.