Currently the barrier supports only 256 threads,
this does not allow to write reliable tests that use more threads.
Bump max number of threads to 1024 to support writing
good stress tests.
Also replace sched_yield() with usleep(100) on the wait path.
If we write tests that create hundreds of threads (and dozens
of tests can run in parallel), yield would consume massive
amounts of CPU time for spinning.
Depends on D106952.
static constexpr ?