This is an optimization that using a single CPU instruction on supported
architectures (amd64 and aarch64, but possibly others) to replace what was
previously an iterative look-up-table algorithm.
Originally I suggested using inline assembly for this in
https://reviews.llvm.org/D117584.
builtin_clzl is just extended to either builtin_clz or builtin_clzll depending on the size of unsigned long. It might be better to keep uint32_t and uint64_t specializations and directly use __builtin_clz for uint32_t, builtin_clzll for uint64_t, together with static_assert(sizeof(unsigned int) == 4) and static_assert(sizeof(unsigned long long) == 8) instead?