This is an archive of the discontinued LLVM Phabricator instance.

Try again to implement a FIFO task queue
ClosedPublic

Authored by zturner on Jun 15 2018, 2:52 PM.

Details

Summary

In D48115 I attempted to modify ThreadPool so that it could support tasks that returned values. I got this working, but in doing so I added a test that relied on sequential execution of tasks that run on the same thread, and this uncovered some problems in my approach.

What I *really* wanted was to create a ThreadPool with a single thread, queue up asynchronous work to it, and then assume that the work would be completed sequentially. After having some difficulty making these assumptions fit be consistent with those of the ThreadPool class, we decided that ultimately ThreadPool isn't the right place for that, and we should have a separate class explicitly for this purpose. So that patch was reverted.

Enter TaskQueue. TaskQueue consists of exactly 1 thread, that always executes its work first in first out. When you queue asynchronous work to it, it returns a future that can be waited on. When the wait returns, you are guaranteed that the work you waited on, as well as all previous work, has been completed.

Getting the LLVM_ENABLE_THREADS=0 case working is quite a tricky job. There may be a better solution than what I've come up with here, but this is the only way I could get it to work. When LLVM_ENABLE_THREADS=1, the problem is very easy. Just add the work to a queue, and process the queue on a single blessed background thread. When LLVM_ENABLE_THREADS=0 however, you have to consider the case where a user queues up multiple asynchronous tasks and then waits on them out of order. So if you queue F1, F2, and F3, and wait on F3, F3 has to manually invoke F2's work, which has to manually invoke F1's work. So the implementations of LLVM_ENABLE_THREADS and !LLVM_ENABLE_THREADS diverge somewhat, but I couldn't find any other way to make this work.

Note that my final goal (supporting tasks that return a value) is still unimplemented in this patch. I plan to do that in a followup, as there are some additional complexities that need to be taken into consideration, mostly in the LLVM_ENABLE_THREADS=0 branch.

Diff Detail

Event Timeline

zturner created this revision.Jun 15 2018, 2:52 PM
zturner updated this revision to Diff 151590.Jun 15 2018, 5:09 PM

Added 2 more tests, the first of which exposed a bug in the wait() function. All the wait() function was doing was joining the work thread, but this is incorrect. What it's supposed to do is simply drain the queue. i.e. keep processing work until the queue is empty. It's not supposed to join the thread. Joining the thread doesn't work anyway since we weren't setting the EnableFlag to be false, so this method was deadlocking.

To fix this I had to borrow some more code from ThreadPool, but all tests now pass both with LLVM_ENABLE_THREADS=0 and LLVM_ENABLE_THREADS=1.

zturner updated this revision to Diff 151592.Jun 15 2018, 5:15 PM

Made the test cases a little stronger. Before I was simply assigning to the local variables in the test case. Now I increment them. This way the test will fail if a task were to ever get executed more than once, for example, whereas before it would pass since it would just be re-assigning the same value.

Before I dig in too much to the code, I want to better understand the underlying goal...

The core primitive here is a *serialized* task queue from what I understand...

Is it important that it starts a separate thread and eagerly (as opposed to lazily) processes the enqueued tasks?

If not, what about making this a lazy system and a more literal work queue?

If so, would it be a problem for the thread(s) used to be shared with other work at times (provided the synchronized aspect is preserved)?

Before I dig in too much to the code, I want to better understand the underlying goal...

The core primitive here is a *serialized* task queue from what I understand...

Is it important that it starts a separate thread and eagerly (as opposed to lazily) processes the enqueued tasks?

Yes

If so, would it be a problem for the thread(s) used to be shared with other work at times (provided the synchronized aspect is preserved)?

Depends what you mean by shared. Certainly you can queue up any work you want here, so if your jobs are A, C, and E, and someone else's jobs are B, D, and F, then queueing up A, B, C, D, E, F should work just fine.

The motivation for this is imagine you're writing an application that segments its work into different threads. For example, you might have a UI thread, a file i/o thread, a network thread. Work that changes the state of a UI element has to happen on the UI thread. Work that physically waits on a socket has to happen on the network thread, etc. This is a pretty common architecture for certain types of applications. Another example might be if you were writing a debugger which was debugging N processes simultaneously, and due to various OS restrictions, each of these N processes was being debugged by a separate thread. The application controlling all this would like to not have to deal with synchronization, but if each of these threads just said "wait for something to happen, then invoke a callback" the callback would happen on an arbitrary thread context and the user would be responsible for synchronizing all of this. It's nice to be able to guarantee something like "the callback always gets invoked on the same thread" to simplify the user's life.

LMK if this makes sense.

Another example might be if you were writing a debugger which was debugging N processes simultaneously, and due to various OS restrictions, each of these N processes was being debugged by a separate thread.

We've had something similar to this in lldb (it was called ThreadStateCoordinator), but it was removed when we made the whole lldb-server process single-threaded.

llvm/include/llvm/Support/TaskQueue.h
42–44

Are all of these really needed? The only use of ParentFuture I see is in the lambda in ThisFuture, and that one already has it captured. ThisTask is not captured by that lambda, but it could easily be, at which point the whole Task struct could just become std::shared_future<void>.

llvm/lib/Support/TaskQueue.cpp
80

The assert message could be updated to reflect the new home.

103–108

This could be implemented as Tasks.back().wait(), avoiding the direct access into the ThisTask function and making sure the tasks are always invoked the same way.

llvm/unittests/Support/TaskQueueTest.cpp
65–71

I'm not sure if it's necessary, but if you wanted to have the test be uniform for both kinds of builds, you could replace MainThreadReady with an integer variable and then have the tasks wait until that variable is >= some_value.

zturner added inline comments.Jun 18 2018, 4:38 PM
llvm/include/llvm/Support/TaskQueue.h
42–44

I was able to remove them as you suggested in this patch, but things get more tricky in the follow-up patch where I extend this to allow the queued up tasks to return values. Regardless, since these are independent patches, I'm uploading a new version that has the suggested changes and I can try to make the other one use a similar model separately.

zturner updated this revision to Diff 151817.Jun 18 2018, 4:39 PM

Some suggestions about the way you package things and manage std::future stuff below.

Some higher level comments here...

The term I think you're looking for is "synchronized" or "serial". So I would call this a SynchronizedTaskQueue or SerializedTaskQueue. Given the other use of the term "serialized", I'd lean towards Synchronized or Sync if that's too long to type. Ultimately, the name needs to make it clear that it enforces that the tasks scheduled here execute without overlap.

I also wonder whether you want to decouple this serialized execution from the thread creation and management. Could you accept a thread pool instead, and enqueue work on the thread pool? You can do this by having each task on the thread pool, on completion, check the queue for more work, and if there is some, schedule a new task on the thread pool. The reason why this is interesting is because if you have, say, a system thread pool that maybe gets a lot of other work, you can intersperse these tasks on it without overloading the system. It also allows the client of this class to dictate which threads are allowable threads for these tasks to run by providing a correct thread pool. My hope is that worst case, creating a one-thread thread pool and handing it to this class would be the equivalent of what you have here, but with a more clean separation of roles. I understand it would add some complexity, for example you'd have to handle the case where no work is currently in-flight when asyncImpl finishes adding the task to the queue and start a new batch. But this seems reasonably tolerable.

llvm/include/llvm/Support/TaskQueue.h
36–42

Rather than these types, I would suggest you build your task more like the following:

template <typename Callable>
std::future<void> async(Callable &&C) {
  // Because we don't have init capture to use move-only local variables that
  // are captured into a lambda, we create the promise inside an explicit callable
  // struct. We want to do as much of the wrapping in the type-specialized domain
  // (before type erasure) and then erase this into a std::function.
  struct Task {
    Callable C;
    std::promise<void> P;

    void operator()() noexcept {
      // Note, we don't catch exceptions and store them in the promise here because
      // we don't support exception handling in LLVM.
      C();
      P.set_value(C());
    }
  };

  Task T = {std::forward<Callable>(C)};
  std::future<void> F = T.P.get_future();
  asyncImpl(std::move(T));
  return F;
}

This is somewhat similar to what std::packaged_task does, but due to how std::packaged_task works, this will probably end up a bit lighter weight (specifically, std::packaged_task type erases the callable separately from the promise, while this packages the callable and the promise together ahead of time and then type erases it afterward.

The end result is that your task queue operates in a super simple closure type: std::function<void()>. This will be true even when you produce a value from the callable and store it into the promise. And this should also work even when you have no threading support at all. std::shared_future is a lot more expensive ultimately than std::future.

Does all of this make sense?

And FWIW, this is where C++14 is actually helpful, as there the above goop looks a lot cleaner:

template <typename Callable>
std::future<void> async(Callable &&C) {
  std::promise<void> P;
  std::future<void> F = P.get_future();
  asyncImpl([C = std::forward<Callable>(C), P = std::move(P)] {
      // Note, we don't catch exceptions and store them in the promise here because
      // we don't support exception handling in LLVM.
      C();
      P.set_value(C());
  });

  return F;
}
50–58

I regularly find std::bind to be problematic and hard to understand. Rather than expose this at all, I would force clients to handle it by passing you a lambda (or a std::bind if they so choose) that handles all of the argument binding. Then you just have a callable (the overload below).

63–64

I would term this a Callable as it may not be a function at all. The most common case I suspect will be a lambda.

I also wonder whether you want to decouple this serialized execution from the thread creation and management. Could you accept a thread pool instead, and enqueue work on the thread pool? You can do this by having each task on the thread pool, on completion, check the queue for more work, and if there is some, schedule a new task on the thread pool. The reason why this is interesting is because if you have, say, a system thread pool that maybe gets a lot of other work, you can intersperse these tasks on it without overloading the system. It also allows the client of this class to dictate which threads are allowable threads for these tasks to run by providing a correct thread pool. My hope is that worst case, creating a one-thread thread pool and handing it to this class would be the equivalent of what you have here, but with a more clean separation of roles. I understand it would add some complexity, for example you'd have to handle the case where no work is currently in-flight when asyncImpl finishes adding the task to the queue and start a new batch. But this seems reasonably tolerable.

Isn't using multiple threads strictly incompatible with the idea of it being serialized or synchronized? With multiple threads, you will have work overlapping, which confuses the purpose of what this is for. If someone comes up with a use case for it, then maybe, but otherwise, YAGNI?

I like your other suggestions (which I snipped from the quote), I'll try to make it work and report back if I can't.

One thing that's annoying is that you can't call promise.set_value(V) when V is void. Bleh.

I also wonder whether you want to decouple this serialized execution from the thread creation and management. Could you accept a thread pool instead, and enqueue work on the thread pool? You can do this by having each task on the thread pool, on completion, check the queue for more work, and if there is some, schedule a new task on the thread pool. The reason why this is interesting is because if you have, say, a system thread pool that maybe gets a lot of other work, you can intersperse these tasks on it without overloading the system. It also allows the client of this class to dictate which threads are allowable threads for these tasks to run by providing a correct thread pool. My hope is that worst case, creating a one-thread thread pool and handing it to this class would be the equivalent of what you have here, but with a more clean separation of roles. I understand it would add some complexity, for example you'd have to handle the case where no work is currently in-flight when asyncImpl finishes adding the task to the queue and start a new batch. But this seems reasonably tolerable.

Isn't using multiple threads strictly incompatible with the idea of it being serialized or synchronized? With multiple threads, you will have work overlapping, which confuses the purpose of what this is for. If someone comes up with a use case for it, then maybe, but otherwise, YAGNI?

It isn't about *using* multiple threads, its about selecting which threads run things.

Whether tasks are serialized in their execution is independent of how many threads they run on. Its just that there doesn't *need* to be more than one if they are serialized.

But a system thread pool might expose the equivalent of <1 thread.

Think of this example. You have three different systems which each want to use this infrastructure to run a series of asynchronous tasks, but for each system the series of tasks must run in-order (FIFO) and without overlap. You can re-use a single threadpool with potentially fewer than three threads for all three of these systems. This allows the client to control the rate of work -- perhaps only allowing at most two operations to happen concurrently.

While with three systems and two threads this seems somewhat odd, if you have *many* systems (maybe 100s) all using this to coordinate a series of tasks, you might not want hundereds of threads, all spinning on their own queues. Instead, it seems nicer to have a single threadpool, sized appropriately for the amount of resources on the system, and have all of the users of this class enqueue their serial sequence of tasks into this shared threadpool.

I like your other suggestions (which I snipped from the quote), I'll try to make it work and report back if I can't.

Yeah, the intent was that they're totally orgthogonal.

One thing that's annoying is that you can't call promise.set_value(V) when V is void. Bleh.

OMG YES! =[ I wish I knew a better way than horrible specialization on void that has different code.

I also wonder whether you want to decouple this serialized execution from the thread creation and management. Could you accept a thread pool instead, and enqueue work on the thread pool? You can do this by having each task on the thread pool, on completion, check the queue for more work, and if there is some, schedule a new task on the thread pool. The reason why this is interesting is because if you have, say, a system thread pool that maybe gets a lot of other work, you can intersperse these tasks on it without overloading the system. It also allows the client of this class to dictate which threads are allowable threads for these tasks to run by providing a correct thread pool. My hope is that worst case, creating a one-thread thread pool and handing it to this class would be the equivalent of what you have here, but with a more clean separation of roles. I understand it would add some complexity, for example you'd have to handle the case where no work is currently in-flight when asyncImpl finishes adding the task to the queue and start a new batch. But this seems reasonably tolerable.

Isn't using multiple threads strictly incompatible with the idea of it being serialized or synchronized? With multiple threads, you will have work overlapping, which confuses the purpose of what this is for. If someone comes up with a use case for it, then maybe, but otherwise, YAGNI?

It isn't about *using* multiple threads, its about selecting which threads run things.

Whether tasks are serialized in their execution is independent of how many threads they run on. Its just that there doesn't *need* to be more than one if they are serialized.

But a system thread pool might expose the equivalent of <1 thread.

Think of this example. You have three different systems which each want to use this infrastructure to run a series of asynchronous tasks, but for each system the series of tasks must run in-order (FIFO) and without overlap. You can re-use a single threadpool with potentially fewer than three threads for all three of these systems. This allows the client to control the rate of work -- perhaps only allowing at most two operations to happen concurrently.

While with three systems and two threads this seems somewhat odd, if you have *many* systems (maybe 100s) all using this to coordinate a series of tasks, you might not want hundereds of threads, all spinning on their own queues. Instead, it seems nicer to have a single threadpool, sized appropriately for the amount of resources on the system, and have all of the users of this class enqueue their serial sequence of tasks into this shared threadpool.

Would you be willing to allow that as a followup patch? I want to get the single threaded case working with this patch and the extension to allow tasks that return values, and then maybe add that on in a followup? (I'm not punting on it either, I actually will do it, but as you said it seems orthogonal so maybe makes sense to do in a followup)

I can't get your suggestion to work, it's very bizarre. It's claiming that I can't even *move* construct a std::function because Task has a deleted copy constructor.

See for yourself here:

https://godbolt.org/g/x1eCnX

Any ideas?

zturner added a comment.EditedJun 18 2018, 8:42 PM

I can't get your suggestion to work, it's very bizarre. It's claiming that I can't even *move* construct a std::function because Task has a deleted copy constructor.

See for yourself here:

https://godbolt.org/g/x1eCnX

Any ideas?

Hmm... http://en.cppreference.com/w/cpp/utility/functional/function/function

Type requirements
-F must meet the requirements of Callable and CopyConstructible.

Yuck. Am I out of luck with this method? I could make it a shared_ptr<promise<void>>, but...

Yea I'm not really able to get anything working using the suggested method of the local class. The Task class here doesn't have any logic to wait on the parent task to complete, and furthermore we just return a future from a promise without using async with deferred launch, so waiting on the future doesn't run the function. All of this is fixable, but ultimately you just end up getting pretty close to what I already have I think.

zturner updated this revision to Diff 152013.Jun 19 2018, 7:03 PM

Updated based on suggestions from Chandler. I don't have the LLVM_ENABLE_THREADS=0 case working yet, but wanted to make sure this is generally what you had in mind.

I also removed the wait() function entirely, as it's a little more difficult to implement now. I can't just wait on the ThreadPool because there could in theory be other work on the thread pool, so waiting on the task queue is not the same as waiting on the thread pool. I don't have an immediate need for this functionality, so unless someone else does, yagni.

zturner updated this revision to Diff 152018.Jun 19 2018, 9:02 PM

This gets the LLVM_ENABLE_THREADS=0 case working. It does so by taking advantage of pointer invalidation properties of std::deque, so I can save off a pointer to the newly inserted element and just run tasks until the pointers match.

Really nice approach for THREADS=0 w/ the deferred future that checks that "enough" of the queue has been processed. I like it.

If you want to avoid the address stability stuff, you could always keep an integer next to the things in the queue and track the integers...

I want to think a bit harder to understand if using the address creates an A-B-A problem. I feel like it does, but my brain is exhausted and not able to pin it down yet.

Unrelated comments inline....

llvm/include/llvm/Support/TaskQueue.h
32–33

Documentation needs updating based on the new design.

47

Assert that the queue is empty at this point?

Also, rather than waiting on the entire thread-pool, maybe enqueue a no-op future into the queue and wait on that? We don't need the underlying executor to finish, just the queue to empty.

65–68

This doesn't seem right...

Only if there are *no* tasks in flight do you want to mov this into the scheduler's async. And when you do, you want to set tasks in flight to be true.

89

Add a FIXME about the std::function issue? D48349 has unique_function to eventually fix this.

96

FWIW, the copy of std::function, and the eventual move of llvm::unique_function won't be free. It may be worthwhile to keep this in a non-type-erased state by making CreateTask here a lambda within the async function above. =/ Just something to think about.

Really nice approach for THREADS=0 w/ the deferred future that checks that "enough" of the queue has been processed. I like it.

If you want to avoid the address stability stuff, you could always keep an integer next to the things in the queue and track the integers...

I want to think a bit harder to understand if using the address creates an A-B-A problem. I feel like it does, but my brain is exhausted and not able to pin it down yet.

I don't think it does, because in the LLVM_ENABLE_THREADS=0 case, there's literally only one thread in the entire application.

zturner updated this revision to Diff 152208.Jun 20 2018, 5:48 PM

Updated with suggestions from chandler. I'm not sure why all my tests passed even despite having that bug that he noticed. Would be nice if there were a way to write a test for that case, but I'm not sure how. I also made the cleanup suggested to eliminate the extra cost of copying the task. This was a little bit annoying, because I had to go back to using the callable object instead of using the lambda so that I could get move semantics of the callable. But this means I can't easily capture this, so I had to move the callable up to the class level so that I could make it a friend to give it access to TaskQueue's internal state. Anyway, it all works now.

Updated with suggestions from chandler. I'm not sure why all my tests passed even despite having that bug that he noticed. Would be nice if there were a way to write a test for that case, but I'm not sure how.

I think what happened was that all tasks were inserted into the scheduler unconditionally. But, since that was a one-thread scheduler which happens to process tasks in sequence, everything worked as it should.

It would be interesting to have a test with scheduler with at least two threads which verifies that the tasks are still processed serially. You could have the first task block and then validate that the second task hasn't been started.

llvm/include/llvm/Support/TaskQueue.h
38

Task uses different tags here (class) and in the definition (struct). Also, I think this friend declaration isn't really needed as Task should have access to TaskQueue internals anyway.

53–67

Maybe this part could be moved to a function in the TaskQueue class? It means less code would need to be duplicated between different Task instantiations, and the management of TaskQueue internal data structures would stay within that class.

132–133

Here you're using NextPtr after you have popped the element it points to off the queue. I think this is technically UB. Instead of that you could just have bool Done variable which you set at the beginning of the loop.

After that ABA should not be a problem since you're only ever using the ElementPtr pointer while it is live.

zturner updated this revision to Diff 152427.Jun 21 2018, 8:27 PM

Updated based on more suggestions. I switched from using deque pointer stability to using explicit ids. I think this makes the code easier to reason about, especially in the followup patch which allows the tasks to return values. I also split out the LLVM_ENABLE_THREADS=1 case from the alternative. They were diverging enough that all the pre-processor branches made it difficult to follow. I think this is cleaner and more maintainable this way.

I think I noticed one more bug in the !threads implementation. Other than that, this looks fine to me, but I'll leave the final ok to Chandler.

llvm/include/llvm/Support/TaskQueue.h
145–151

I just realized this could be wrong when the futures are waited on out of order. E.g. if you have F1 and F2, F2.wait() will increment NextRunId to 2. Then if you do F1.wait(), it will loop until NextRunId wraps back to 1 and crash in the process.

Fortunately, with RunIds it should be easy to fix this by changing != to <. If we're worried about uint64_t wrapping we can do something like Id-NextRunId < UINT64_MAX/2

(Also, this code could be moved into a separate function to decrease the number of instantiations of the lambda.)

After some discussion with Chandler, we actually decided to just delete the nothreads codepath and generate a compile error if anyone tries to use this class with nothreads. I'll upload a fix that does that later today.

zturner updated this revision to Diff 152550.Jun 22 2018, 3:49 PM

This patch deletes the LLVM_ENABLE_THREADS=0 codepath and just errors out if you try to use this class in those circumstances. Since the regular codepath is actually pretty straightforward and doesn't have much complexity, I went ahead and added support for tasks that return values directly to this patch. It was only a couple of line change and saves me from having to make another patch for it.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 24 2018, 8:17 PM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Jun 26 2018, 6:38 AM

This patch deletes the LLVM_ENABLE_THREADS=0 codepath and just errors out if you try to use this class in those circumstances. Since the regular codepath is actually pretty straightforward and doesn't have much complexity, I went ahead and added support for tasks that return values directly to this patch. It was only a couple of line change and saves me from having to make another patch for it.

Unsurprisingly this breaks the LLVM_ENABLE_THREADS=0 build:

FAILED: unittests/Support/CMakeFiles/SupportTests.dir/TaskQueueTest.cpp.o 
/b/build/slave/linux_upload_clang/build/src/third_party/llvm-bootstrap-install/bin/clang++   -DGTEST_HAS_PTHREAD=0 -DGTEST_HAS_RTTI=0 -DGTEST_HAS_TR1_TUPLE=0 -DGTEST_LANG_CXX11=1 -D_DEBUG -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -Iunittests/Support -I/b/build/slave/linux_upload_clang/build/src/third_party/llvm/unittests/Support -I/usr/include/libxml2 -Iinclude -I/b/build/slave/linux_upload_clang/build/src/third_party/llvm/include -I/b/build/slave/linux_upload_clang/build/src/third_party/llvm/utils/unittest/googletest/include -I/b/build/slave/linux_upload_clang/build/src/third_party/llvm/utils/unittest/googlemock/include --gcc-toolchain=/b/build/slave/linux_upload_clang/build/src/third_party/llvm-build-tools/gcc485precise -fPIC -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -std=c++11 -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wcovered-switch-default -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wstring-conversion -fdiagnostics-color -ffunction-sections -fdata-sections -O3    -UNDEBUG  -Wno-variadic-macros -Wno-gnu-zero-variadic-macro-arguments -fno-exceptions -fno-rtti -MMD -MT unittests/Support/CMakeFiles/SupportTests.dir/TaskQueueTest.cpp.o -MF unittests/Support/CMakeFiles/SupportTests.dir/TaskQueueTest.cpp.o.d -o unittests/Support/CMakeFiles/SupportTests.dir/TaskQueueTest.cpp.o -c /b/build/slave/linux_upload_clang/build/src/third_party/llvm/unittests/Support/TaskQueueTest.cpp
In file included from /b/build/slave/linux_upload_clang/build/src/third_party/llvm/unittests/Support/TaskQueueTest.cpp:10:
/b/build/slave/linux_upload_clang/build/src/third_party/llvm/include/llvm/Support/TaskQueue.h:84:5: error: static_assert failed "TaskQueue requires building with LLVM_ENABLE_THREADS!"
    static_assert(false,
    ^             ~~~~~

What's the thinking here? Maybe TaskQueueTest.cpp shouldn't be part of the build if LLVM_ENABLE_THREADS=0? As-is, this just gives you a compile error if you use that option and build all targets (like our bots do).

zturner added a subscriber: labath.Jun 26 2018, 7:24 AM

This whole file is supposed to be ifdefed out in this case. Is it not? I
definitely tested this case

This whole file is supposed to be ifdefed out in this case. Is it not? I
definitely tested this case

Looking at it, it seems that the code is tested out, but the #include isn't, and that's what errors. Maybe you tested it by reconfiguring an existing cmake build dir instead of using a fresh one and were led astray by cmake's overaggressive caching of build settings?

The static_assert is in a template function though, which shouldn’t get
instantiated unless you create an instance of the class. And creating that
instance is ifdef’ed out. If it fixes it just move the ifdef up, but I’m a
bit confused.

Also i tested with msvc I guess that could be it

Tried moving it up in r335608.