The comment says this was stopped because it was unlikely to be
profitable. This is not true if you want to combine vector loads
with multiple components.
For a simple case that looks like
t0 = load t0 ... t1 = load t0 ... t2 = load t0 ... t3 = load t0 ... t4 = store t0:1, t0:1 t5 = store t4, t1:0 t6 = store t5, t2:0 t7 = store t6, t3:0
We want to get all of these stores onto a chain
that is a TokenFactor of these N loads. This mostly
solves the AMDGPU merge-stores.ll regressions
with -combiner-alias-analysis for merging vector
stores of vector loads.
It would be nice, IMHO, to get rid of these hard-coded depth limits and turn them into cl::opts. This is a general comment, as I feel the same way about all of these depth limits everywhere. [I'm not requesting that you change this here].