The change tries to load less vector elements based on demanded elements.
This might not help code generation quality for targets that will finally
scalarize the load. But this helps a lot for target like AMDGPU which will map
to native vector load.
The motivating case for the change is we observe below pattern in
compute workload:
%a = load <i32 x 4> %b = load <i32 x 4> use(%a.012)
As the last element of %a was not used, the register allocator reuse the
physical register for the unused element, then it cause an unncessary
s_waitcnt inserted between the two loads.
$v0_v1_v2_v3 = load <i32 x 4> s_waitcnt $v3_v4_v5_v6 = load <i32 x 4>
The change here would help avoiding such case in backend, and in general
this should also help reducing memory traffic.
Should the original load be replaced rather than kept around?