We only need to use 512 bit vectors all the way through v8i64 reductions since those max instructions are new to avx512f and only available in 512 bits until SKX.
For v16i32 and floating point we have legacy 128/256 bit instructions we can use.
I've tried to use other intrinsics to reduce the verbosity of the code and avoid having to mention all the shuffles. I've also removed all the -1 shuffle indices so the output sequence is fully specified and not left to backend optimization.
Would it be dumb to allow VLX capable CPUs to use 128/256 variants of the VPMAXUQ etc ? Or is it better to focus on improving SimplifyDemandedElts to handle this (and many other reduction cases that all tend to keep to the original vector width)?