Page MenuHomePhabricator

AMDGPU/SILoadStoreOptimizer: Optimize scanning for mergeable instructions
ClosedPublic

Authored by tstellar on Aug 8 2019, 10:39 AM.

Details

Summary

This adds a pre-pass to this optimization that scans through the basic
block and generates lists of mergeable instructions with one list per unique
address.

In the optimization phase instead of scanning through the basic block for mergeable
instructions, we now iterate over the lists generated by the pre-pass.

The decision to re-optimize a block is now made per list, so if we fail to merge any
instructions with the same address, then we do not attempt to optimize them in
future passes over the block. This will help to reduce the time this pass
spends re-optimizing instructions.

In one pathological test case, this change reduces the time spent in the
SILoadStoreOptimizer from 0.2s to 0.03s.

This restructuring will also make it possible to implement further solutions in
this pass, because we can now add less expensive checks to the pre-pass and
filter instructions out early which will avoid the need to do the expensive
scanning during the optimization pass. For example, checking for adjacent
offsets is an inexpensive test we can move to the pre-pass.

Diff Detail

Repository
rL LLVM

Event Timeline

tstellar created this revision.Aug 8 2019, 10:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2019, 10:39 AM
arsenm added inline comments.Sep 5 2019, 11:49 AM
llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
181 ↗(On Diff #214187)

Typo s/on/no

1589 ↗(On Diff #214187)

Why std::list, and a std::list of lists?

tstellar marked an inline comment as done.Sep 13 2019, 5:38 PM
tstellar added inline comments.
llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
1589 ↗(On Diff #214187)

The main reason to use lists is so I can remove items without invalidating iterators.

arsenm accepted this revision.Sep 19 2019, 5:05 PM

LGTM but I would rather avoid the list usage

This revision is now accepted and ready to land.Sep 19 2019, 5:05 PM

LGTM but I would rather avoid the list usage

What do you think would be a better alternative? The operations used iterate, emplace_back, size, and erase.

arsenm added a comment.Oct 2 2019, 4:02 PM

LGTM but I would rather avoid the list usage

What do you think would be a better alternative? The operations used iterate, emplace_back, size, and erase.

ok

This revision was automatically updated to reflect the committed changes.

Thank you for doing this, it seems quite useful. As a follow-up to this change, do you think it makes sense to refactor CombineInfo a bit? We have a list of mergeable instructions, but the CombineInfo structure also has fields for a second instruction, which are only for temporary use, which is a bit odd.

foad added a subscriber: foad.EditedNov 28 2019, 7:54 AM

Hi @tstellar, I'm looking into a case where this patch slowed down a shader by 10%. Before I go too far, was this patch supposed to change the behaviour at all, or was it supposed to be purely a compile time improvement?

In the case I'm looking at it seems to do the same amount of load merging as before, but the merged loads are inserted at different places in the basic block.

Hi @tstellar, I'm looking into a case where this patch slowed down a shader by 10%. Before I go too far, was this patch supposed to change the behaviour at all, or was it supposed to be purely a compile time improvement?

The intention was to not change the behavior at all.

In the case I'm looking at it seems to do the same amount of load merging as before, but the merged loads are inserted at different places in the basic block.

Do you have a MIR or .ll dump of the shader I could look at ? Also, does https://reviews.llvm.org/D65966 help?

foad added a comment.EditedDec 3 2019, 3:36 AM

Hi @tstellar, I'm looking into a case where this patch slowed down a shader by 10%. Before I go too far, was this patch supposed to change the behaviour at all, or was it supposed to be purely a compile time improvement?

The intention was to not change the behavior at all.

In the case I'm looking at it seems to do the same amount of load merging as before, but the merged loads are inserted at different places in the basic block.

Do you have a MIR or .ll dump of the shader I could look at ? Also, does https://reviews.llvm.org/D65966 help?

P8173 is a MIR test case. See the RUN line for how to run it. I see significant differences in the placing of the merged BUFFER_LOAD instructions with/without D65961 (or before/after it was committed).

I tried applying D65966 on top of rGe6f51713054f but it made no difference to the output.

Hi @tstellar, I'm looking into a case where this patch slowed down a shader by 10%. Before I go too far, was this patch supposed to change the behaviour at all, or was it supposed to be purely a compile time improvement?

The intention was to not change the behavior at all.

In the case I'm looking at it seems to do the same amount of load merging as before, but the merged loads are inserted at different places in the basic block.

Do you have a MIR or .ll dump of the shader I could look at ? Also, does https://reviews.llvm.org/D65966 help?

P8173 is a MIR test case. See the RUN line for how to run it. I see significant differences in the placing of the merged BUFFER_LOAD instructions with/without D65961 (or before/after it was committed).

I tried applying D65966 on top of rGe6f51713054f but it made no difference to the output.

You need to apply it on top of 3a8d80944b7766449e2c8784a8fb30d19a2ba16c or newer for it to have an impact. When I do that, D65966 does help improve the merging.