This is an archive of the discontinued LLVM Phabricator instance.

InstCombine: Simplify vector load based on demanded elements
AbandonedPublic

Authored by ruiling on Nov 16 2022, 10:44 PM.

Details

Summary

The change tries to load less vector elements based on demanded elements.
This might not help code generation quality for targets that will finally
scalarize the load. But this helps a lot for target like AMDGPU which will map
to native vector load.

The motivating case for the change is we observe below pattern in
compute workload:

%a = load <i32 x 4>
%b = load <i32 x 4>

use(%a.012)

As the last element of %a was not used, the register allocator reuse the
physical register for the unused element, then it cause an unncessary
s_waitcnt inserted between the two loads.

$v0_v1_v2_v3 = load <i32 x 4>
s_waitcnt
$v3_v4_v5_v6 = load <i32 x 4>

The change here would help avoiding such case in backend, and in general
this should also help reducing memory traffic.

Diff Detail

Event Timeline

ruiling created this revision.Nov 16 2022, 10:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2022, 10:44 PM
ruiling requested review of this revision.Nov 16 2022, 10:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 16 2022, 10:44 PM
piotr added inline comments.Nov 17 2022, 12:39 AM
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
1716

Should the original load be replaced rather than kept around?

I've always thought we didn't have such a transformation in the generic instcombine, because it may hurt some targets, but I don't immediately see how. (We do have a similar one for intrinsic loads in the AMDGPU-specific file).

I've always thought we didn't have such a transformation in the generic instcombine, because it may hurt some targets, but I don't immediately see how. (We do have a similar one for intrinsic loads in the AMDGPU-specific file).

For X86, the vector load will be scalarized, thus it is easy to be optimized during instruction selection or later on MachineIR. I guess this may contribute to why we were not doing such optimization before. I could not see why this would hurt other targets.

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
1716

The original load becomes dead and will be deleted afterwards in this pass.

spatel requested changes to this revision.Nov 17 2022, 6:18 AM

I've always thought we didn't have such a transformation in the generic instcombine, because it may hurt some targets, but I don't immediately see how. (We do have a similar one for intrinsic loads in the AMDGPU-specific file).

This is correct - this is not universally beneficial/recoverable.

Looking at the diffs here for example, we can see that a <4 x float> load (legal on x86) is replaced by a <3 x float> load (not legal on x86). There is no way to recover the wider load at any later point in optimization because there is nothing in the IR that says it is safe to load the extra bytes. A single 128-bit load reduced to 96-bit load could then have to be split into a 64-bit load and 32-bit load, and that's likely worse for perf.

We've been trying to solve bugs like that (the source is written in a way that is hard to optimize) for a long time, so this would make things worse. Example:
https://github.com/llvm/llvm-project/issues/17113

I recommend trying this in VectorCombine or AggressiveInstCombine with TTI legality checks (assuming it needs to happen early in IR to unlock other transforms).

This revision now requires changes to proceed.Nov 17 2022, 6:18 AM

Agreed - as it is this patch puts even more strain on the DAG to load combine back to legality.

We don't even seem to have a way to track original dereferencable ranges :(

We also wouldn't want to do this if the load is a scalar load candidate, since there are no 96-bit scalar loads

ruiling abandoned this revision.Dec 4 2022, 11:50 PM

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

arsenm added a comment.Dec 5 2022, 5:55 AM

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

I think the opposite. The backend load passes have to deal with way more patterns and addressing modes

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

I think the opposite. The backend load passes have to deal with way more patterns and addressing modes

It could be done in the LLVM IR part of the code generation pipeline?

I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.

I think the opposite. The backend load passes have to deal with way more patterns and addressing modes

It could be done in the LLVM IR part of the code generation pipeline?

Don't we already do that in VectorCombine?

Don't we already do that in VectorCombine?

We have load scalarization, but there's no subvector-narrowing kind of transform AFAIK:
https://github.com/llvm/llvm-project/blob/ee31a4a7029f2f6fda5f416e7eb67ca3907d9e36/llvm/lib/Transforms/Vectorize/VectorCombine.cpp#L1118