Page MenuHomePhabricator

[AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads.
ClosedPublic

Authored by alex-t on Oct 25 2016, 8:27 AM.

Details

Summary

Change explores the fact that LDS reads may be reordered even if access the same location.

Prior the change, algorithm immediately stops as soon as any memory access encountered between loads that are expected to be merged together. Although, Read-After-Read conflict cannot affect execution correctness.

Improves hcBLAS CGEMM manually loop-unrolled kernels performance by 44%. Also improvement expected on any massive sequences of reads from LDS.

Diff Detail

Repository
rL LLVM

Event Timeline

alex-t updated this revision to Diff 75706.Oct 25 2016, 8:27 AM
alex-t retitled this revision from to [AMDGPU][CodeGen] To improve CGEMM performance: combine LDS reads..
alex-t updated this object.
alex-t set the repository for this revision to rL LLVM.
alex-t added a subscriber: Restricted Project.

These 2 files are here to illustrate the ISA before and after the fix. See BB10 to view the effect.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
155–156 ↗(On Diff #75706)

I see this logic repeated in a few places, can we turn it into a helper function?

218–222 ↗(On Diff #75706)

Need to use c++ style comments.

281–282 ↗(On Diff #75706)

This doesn't seem correct. The comment mentions that we need to check that it was safe to move all the instruction in InstsToMove down past this instruction. Is this condition being met?

test/CodeGen/AMDGPU/cgemm_loopunroll_ds_combine.ll
1 ↗(On Diff #75706)

This test case is too big, can you try to reduce it. Also you should run use opt -metarenamer to simplify the names. If you can't get a reduced IR test case, I would recommend writing an MIR test case.

rampitec edited edge metadata.Oct 25 2016, 11:25 AM

This is good as a fast workaround. In a long run direct selection of a vector operation from DAG seems more desirable.

alex-t added inline comments.Oct 25 2016, 12:46 PM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
281–282 ↗(On Diff #75706)

Yes. If this 2 mem accesses may alias but neither is write access it is safe to reorder them. In this condition we check if 2 mem accesses may alias and if at least one may write we break out. Otherwise we are free to search down until the BB ends or we face the instruction that cannot be moved.

alex-t added inline comments.Oct 26 2016, 6:39 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
281–282 ↗(On Diff #75706)

and yes, you are rigth.
I need to check if I users collected so far can be moved across MBBI before I go ahead looking for the merge point.

alex-t updated this revision to Diff 75914.Oct 26 2016, 10:02 AM
alex-t edited edge metadata.
alex-t removed rL LLVM as the repository for this revision.

Fixed according the reviewers comments.
Test case reducing in progress. Not trivial because smaller cases use less registers.
We need large unroll to expose the problem.

alex-t marked 4 inline comments as done.Oct 26 2016, 10:03 AM

Changed source code. New test is upcoming.

Fixed according the reviewers comments.
Test case reducing in progress. Not trivial because smaller cases use less registers.

If it's not trivial you may want to look at writing a MIR testcase. This will be easier and also be a better test. For examples look in test/CodeGen/MIR/

We need large unroll to expose the problem.

arsenm added inline comments.Oct 26 2016, 4:18 PM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗(On Diff #75914)

We don't actually have AA enabled in the backend. We also need to add an address space alias analysis pass. If these were done, would that avoid the need to have this looser check?

alex-t updated this revision to Diff 76023.Oct 27 2016, 7:27 AM
alex-t edited edge metadata.

Small test case.

alex-t added inline comments.Oct 27 2016, 7:34 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗(On Diff #75914)

No. Even 2 reads from same segment (address space) and exactly same address are legal to be reordered as they both have no side effects. This is Read-After-Read (RAR) conflict. The check above is necessary to allow reordering Read-After-Write and Write-After-Write in case we can prove that 2 memory operations accesses different locations.
As for the address space layer in AA, I added such in HSAIL backend. It is trivial and I can port it to LC quickly if you feel it is necessary.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗(On Diff #75914)

The alias analysis passes are always run so the information is available. There is a usesAA() subtarget query, but that's only used by a few places, mostly in the DAGCobminer.

alex-t marked 2 inline comments as done.Oct 27 2016, 8:18 AM
alex-t added inline comments.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗(On Diff #75914)

As far as I understand Matt meant that we'd like to have custom AA layer that intercepts alias queries with 2 different address spaces. Given that:

  • flat can alias any
  • constant can alias nothing
  • global group and private only can alias pointer in same address space
rampitec added inline comments.Oct 27 2016, 1:50 PM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153 ↗(On Diff #76023)

As far as I understand that is only legal to reorder two instructions if there are no stores in between of them which might go into the memory of instruction being moved. Also you need to make sure there are no barriers or fences in between. This function only checks two instructions passed but not anything in between. I.e. that is only legal if these two instructions are adjacent. I cannot immediately see it is only called for adjacent instructions, neither see a comment on such limitation.

tony-tye added inline comments.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
149 ↗(On Diff #75914)

Would be good to consider @jlebar patches to support target-specific AA (D24441 and D12414) which add an NVPTX-specific AA.

alex-t added inline comments.Oct 28 2016, 5:11 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153 ↗(On Diff #76023)

In fact all the checks is done in a loop within findMatchingDSInst

if (MBBI->hasUnmodeledSideEffects())
    // We can't re-order this instruction with respect to other memory
    // opeations, so we fail both conditions mentioned above.
    return E;

  if (MBBI->mayLoadOrStore() &&
    !memAccessesCanBeReordered(*I, *MBBI, TII, AA)) {
    // We fail condition #1, but we may still be able to satisfy condition
    // #2.  Add this instruction to the move list and then we will check
    // if condition #2 holds once we have selected the matching instruction.
    InstsToMove.push_back(&*MBBI);
    addDefsToList(*MBBI, DefsToMove);
    continue;
  }

So any writes in between are in InstsToMove list and will be checked in canMoveInstsAcrossMemOp. Any instructions in between with side effects like barriers etc breaks search.
I had no intention to make a neat fix - just temporary workaround. So I preserved basic logic.
Indeed there is no need to put potential stores in the move list for later check. We can break out immediately faced mayAlias store in between.
If you insist I can change to this.

rampitec added inline comments.Oct 28 2016, 2:25 PM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
153 ↗(On Diff #76023)

Thanks. LGTM as a w/a provided tests are passed.

arsenm added inline comments.Oct 28 2016, 3:16 PM
test/CodeGen/AMDGPU/ds_read2.ll
496–498 ↗(On Diff #76023)

You should run instnamer on the test. Also positive checks are more useful

alex-t updated this revision to Diff 76407.Oct 31 2016, 7:56 AM

Test case passed through the opt -instrenamer

alex-t marked an inline comment as done.Oct 31 2016, 8:00 AM
alex-t added inline comments.
test/CodeGen/AMDGPU/ds_read2.ll
496–498 ↗(On Diff #76023)

The goal of the test is to check that all LDS reads are combined.
So, any "ds_read_b32" in the output is the regression. In the original CGEMM inner loop there were 64 ds_read_b32 and all of them are combined with the patch. The test case is same but shorter excerpt. That's why it is negative.

This revision was automatically updated to reflect the committed changes.
alex-t marked an inline comment as done.