This is an archive of the discontinued LLVM Phabricator instance.

[AMDGCN] Run LoadStoreVectorizer before CodeGenPrepare
AbandonedPublic

Authored by rampitec on Apr 9 2020, 2:05 PM.

Details

Reviewers
arsenm
yaxunl
Summary

AMDGPUCodeGenPrepare widens some loads which then prevent
vectorization of an otherwise vectorizable load pair.

Run vectorizer pass before AMDGPUCodeGenPrepare to catch
the opportunity. The second run is still in place as passes
in between also create a lot of vectorization opportunities.

Diff Detail

Event Timeline

rampitec created this revision.Apr 9 2020, 2:05 PM
arsenm added inline comments.Apr 9 2020, 2:13 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
742

This pass is already problematic for compile time and hits quadratic behavior, so I'm not thrilled about running it twice. Why can't we just run it once?

arsenm added inline comments.Apr 9 2020, 2:24 PM
llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18

If the problem is just the widened constant loads, we could maybe handle this directly in the vectorizer.

rampitec marked 2 inline comments as done.Apr 9 2020, 2:27 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
742

There are 111 regressions in check-llvm-codegen-amdgpu if I just move it. All that I have checked are real.

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18

Looks like vectorizer would need to be much smarter than it is now to do so.

arsenm added inline comments.Apr 9 2020, 2:27 PM
llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18

We could also do the widening later

rampitec marked an inline comment as done.Apr 9 2020, 2:48 PM
rampitec added inline comments.
llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18

Split AMDGPUCodeGenPrepare to early and late? It should be possible. If I use -amdgpu-codegenprepare-widen-constant-loads=0 this test is vectorized, so the only known to me problem is widening.

arsenm added inline comments.Apr 9 2020, 2:51 PM
llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18

We don't really need to do the load widening in the IR. We should be able to do in the DAG/gMIR level

In fact we do it on a DAG level, look at t14 transformed into t18:

Initial selection DAG: %bb.0 'test:'
SelectionDAG has 18 nodes:
  t0: ch = EntryToken
  t3: i64 = Constant<0>
    t2: i64,ch = CopyFromReg t0, Register:i64 %3
  t5: i64,ch = load<(dereferenceable invariant load 8, align 16, addrspace 4)> t0, t2, undef:i64
      t6: i64,ch = merge_values t5, t5:1
            t11: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i64<1116>
          t13: i64 = add nuw t11, Constant:i64<4>
        t14: i16,ch = load<(load 2 from %ir.4, align 4, !tbaa !10, addrspace 4)> t0, t13, undef:i64
      t15: i32 = zero_extend t14
        t8: i64 = llvm.amdgcn.kernarg.segment.ptr TargetConstant:i64<1636>
      t9: i64,ch = load<(dereferenceable invariant load 8 from %ir..kernarg.offset.cast, align 16, addrspace 4)> t0, t8, undef:i64
    t16: ch = store<(store 4 into %ir..load, !tbaa !19, addrspace 1)> t6:1, t15, t9, undef:i64
  t17: ch = ENDPGM t16


Optimized lowered selection DAG: %bb.0 'test:'
SelectionDAG has 12 nodes:
  t0: ch = EntryToken
          t11: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i64<1116>
        t13: i64 = add nuw t11, Constant:i64<4>
      t18: i32,ch = load<(load 2 from %ir.4, align 4, !tbaa !10, addrspace 4), zext from i16> t0, t13, undef:i64
        t8: i64 = llvm.amdgcn.kernarg.segment.ptr TargetConstant:i64<1636>
      t9: i64,ch = load<(dereferenceable invariant load 8 from %ir..kernarg.offset.cast, align 16, addrspace 4)> t0, t8, undef:i64
    t19: ch = store<(store 4 into %ir..load, !tbaa !19, addrspace 1)> t0, t18, t9, undef:i64
  t17: ch = ENDPGM t19

IR widening was done by your commit https://reviews.llvm.org/rG90083d3088ae, there you mention "the very weak load merging the DAG does".
However, when I turned off the option -amdgpu-codegenprepare-widen-constant-loads the only test which has failed was widen_extending_scalar_loads.ll for obvious reason.

Why don't I just turn off this option? It does not seem to do anything good anymore.

An alternative: D77835

rampitec abandoned this revision.Apr 10 2020, 8:02 AM