(Note, this extends D91398 and probably won't make any sense unless you've looked at that first.)
When scalarizing a uniform expression, we currently only consider uniformity within a single vector factor. For some expressions, we can exploit the fact that the expression is uniform across all lanes of all vector factors in the unrolling. This patch teaches the VPReplicateRecipe how to achieve this.
After this patch (and the previous one), we can lower a load from a loop invariant address as a single scalar load. (Instead of UF*VF scalar loads and rely on CSE cleaning it up later.)
I'd hoped to exercise this through code paths not involving uniform mem ops, but the cases I tried were mostly covered by existing scalarization logic. I believe this will sometimes trigger with existing code, but have struggled to find a clean example so I made the patch dependent on the uniform memory op work.
Precommit this?