The load ports need a cycle for each potentially loaded element just like Haswell and Skylake. Unlike Haswell and Broadwell, the number of uops does not scale with the number of elements. Instead the load uops run for multiple cycles.
I've taken the latency number from the uops.info. The port binding for the non-load uops is taken from the original IACA data I have.
I've added avx512 gather instructions to llvm-mca resource tests. I wanted to pre-commit them, but since some of them have 0 uops in the existing data, llvm-mca gave an error.