As reported on PR51539, codegen involving vpshufb/vpperm appears to report higher than likely throughput costs.
e.g. ctpop: https://c.godbolt.org/z/4hcaMqPzd
According to the AMDFam15h SoG, these are fastpath (tp = 1.0) but just on pipe1 (xbr). Agner + Instxlat agree that both the latency and throughput are faster than the model as well.
AMD (https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf)
Agner (https://agner.org/optimize/instruction_tables.pdf)
Instxlat (http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0610F01_K15_Piledriver_InstLatX64.txt)
I think most other shuffles should probably be using xbr as well?