When linking with libhwloc, the ORDERED EPCC test slows down on big machines (> 48 cores). Performance analysis showed that a cache thrash
was occurring and this padding helps alleviate the problem.
Also, inside the main spin-wait loop in kmp_wait_release.h, we can eliminate the references to the global shared variables by instead creating a local variable, oversubscribed and instead checking that.