This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Rematerialize scalar loads
Needs ReviewPublic

Authored by piotr on Jun 29 2023, 6:59 AM.

Details

Summary

Extend the list of instructions that can be rematerialized in
SIInstrInfo::isReallyTriviallyReMaterializable() to support scalar loads.

Try shrinking instructions to remat only the part needed for current
context. Add SIInstrInfo::reMaterialize target hook, and handle shrinking
of S_LOAD_DWORDX16_IMM to S_LOAD_DWORDX8_IMM as a proof of concept.

Diff Detail

Event Timeline

piotr created this revision.Jun 29 2023, 6:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 6:59 AM
piotr requested review of this revision.Jun 29 2023, 6:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 6:59 AM
piotr added inline comments.Jun 29 2023, 7:23 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
109

This is a pre-existent issue, but I am not very comfortable with the fact that we override isReallyTriviallyReMaterializable the way we do.

It works because we rely on extra checks being present before the method is invoked (for example on calling LiveRangeEdit::allUsesAvailableAt()). However, the comment in the base class says isReallyTriviallyReMaterializable "must return false if (..) or if it requres any address registers that are not always available."

Having said that, I am not sure if there is a practical solution for that (apart from softening the comment in the base class).

114

Ironically, the addition of the check isDereferenceableInvariantLoad makes the examples that drive this work not handled, as the S_LOADs are not marked as dereferenceable at the moment (but that is another issue).

llvm/test/CodeGen/AMDGPU/remat-smrd.mir
4

The patch handles the positive cases involving loads, and does not affect the negative cases.

4

(Need to add a test for shrinking)

arsenm added inline comments.Jun 29 2023, 9:01 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
109

Yes, I don't like this either. The way the rematerializable hooks is structured is confusing and part of why this never got done before

114

Don't really need the isSMRD with the isDereferenceableInvariantLoad. We should also be able to handle any load

2381

Don't name TRI, should use RI member here

2387–2388

Fix weird line breaks, comment before switch

2393

I'd hope we would have access to a live lane mask that we need to handle. Failing that, do you really need the copy opcode exception?

2396

don't use auto here, or at least use a const *

2400

s/TRI/RI/

2407

Don't think this can happen

2419

Can you go through getSubRegisterClass (possibly with getMatchingSuperRegClass and getSubClassWithSubReg) to avoid hardcoding this

2427

keep this all as int64_t

piotr added inline comments.Jun 29 2023, 10:00 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
114

That was my intuition as well, but I added this extra check for these reasons:

Maybe checking for MMO->isInvariant() is enough? And that would mean the negative test would turn to a positive one.

arsenm added inline comments.Jun 29 2023, 11:13 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
114

Would rematerialization ever happen in a context where the use didn't have the same dominating conditions as at the def? My intuition is that it couldn't, and thus the dereferenceable check (including the generic one) would be too strong

2419

Easier yet would be just use the result class from the instruction desc

Also, is this safe from other users with a different class?

piotr added inline comments.Jun 30 2023, 9:15 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2419

Thanks - will rewrite this, but what exactly do you mean here by "safe from other users"?

arsenm added inline comments.Jun 30 2023, 12:26 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2419

I mean other instructions using the same virtual register that aren't expecting the class to change. You're mutating the existing register and not creating a new one with the new class

piotr updated this revision to Diff 538530.Jul 10 2023, 12:44 AM
  • Addressed review comments.
  • Relaxed check to include all invariant loads, not only dereferenceable ones.
  • Rebased patch over the commit with new/changed tests.
  • Updated MMO with new size in the shrinking path.
piotr marked 9 inline comments as done.Jul 10 2023, 12:51 AM
piotr added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2419

I'm mutating DestReg which does not have users. Adding the assertion, but maybe an early out would be more future-proof.

arsenm added inline comments.Jul 10 2023, 6:01 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
114

As a pre-commit can you post a patch to weaken the generic check? Best to be on the same page as all targets for this

Could really use a MIR test that shows this. Also would be nice to have some evil cases, where the result register is tied to the input pointer register

piotr added a comment.Jul 12 2023, 7:52 AM

Could really use a MIR test that shows this. Also would be nice to have some evil cases, where the result register is tied to the input pointer register

This patch is now based on a test update (https://reviews.llvm.org/D154816), where I am also adding a new test that exercises the shrinking - test_remat_s_load_dword_immx16_subreg.

Can you describe the evil case(s) in more detail? Do you mean S_LOAD_DWORDX16_IMM with tied-def, or something else?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
114

Ok, will take a look.

piotr updated this revision to Diff 542230.EditedJul 19 2023, 4:04 PM

Ran extensive testing on graphics workloads, which uncovered some bugs. Added fixes and more tests for those interesting cases in D154816.

Could really use a MIR test that shows this. Also would be nice to have some evil cases, where the result register is tied to the input pointer register

This patch is now based on a test update (https://reviews.llvm.org/D154816), where I am also adding a new test that exercises the shrinking - test_remat_s_load_dword_immx16_subreg.

Can you describe the evil case(s) in more detail? Do you mean S_LOAD_DWORDX16_IMM with tied-def, or something else?

Yes. I don't think subregisters with tied operands are particularly well defined, but I was thinking something like %0:sreg_256 = S_LOAD_DWORDX16 %0.sub0_sub1

Could really use a MIR test that shows this. Also would be nice to have some evil cases, where the result register is tied to the input pointer register

This patch is now based on a test update (https://reviews.llvm.org/D154816), where I am also adding a new test that exercises the shrinking - test_remat_s_load_dword_immx16_subreg.

Can you describe the evil case(s) in more detail? Do you mean S_LOAD_DWORDX16_IMM with tied-def, or something else?

Yes. I don't think subregisters with tied operands are particularly well defined, but I was thinking something like %0:sreg_256 = S_LOAD_DWORDX16 %0.sub0_sub1

Added test case in D154816.

piotr added inline comments.Jul 21 2023, 12:59 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
114

I looked closer at that. Checked that if I modify TargetInstrInfo::isReallyTriviallyReMaterializableGeneric with a relaxed check (copy of MI.isDereferenceableInvariantLoad with && MMO->isDereferenceable() removed) - then that makes no difference in the existing tests.
X86 relies on further checks in that function "A load from a constant PseudoSourceValue is invariant".
So while that boosted my confidence that the generic code does not rely on "dereferenceable", I am not sure if any changes to the generic check are practical. That would be pretty ugly, as I'd have to pretty much copy isDereferenceableInvariantLoad only to have isDereferenceable() check removed.

piotr added inline comments.Jul 21 2023, 1:01 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
114

(And AMDGPU would not benefit from any of this as we rely on our own logic anyway.)

arsenm added inline comments.Jul 26 2023, 10:15 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
114

I don't follow how it's difficult. It's straightforward relaxing of a restricition? I don't understand how the x86 case complicates matters (also, x86 shouldn't need to special case that instance either)

piotr added inline comments.Jul 26 2023, 11:59 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
114

It is not difficult, but I am unsure if we really want to extend MachineInstr with isInvariantLoad() that would be a copy of isDereferenceableInvariantLoad() but with the relaxed condition. I can add a private common function to avoid duplicating the code, but I do not see how we can avoid adding a new function. We cannot just edit isDereferenceableInvariantLoad because it has more uses than just the one inside isReallyTriviallyReMaterializableGeneric(), and some of them really need isDereferenceable check.

As an alternative, instead of modifying MachineInstr I could just add a new private function in TargetInstrInfo, just for the use in isReallyTriviallyReMaterializableGeneric().

piotr updated this revision to Diff 553492.Aug 25 2023, 8:27 AM

Rebased.

piotr added a comment.Aug 25 2023, 8:35 AM

Ping - I think the only unresolved point is potential weakening of the generic check.

piotr added a comment.Sep 21 2023, 5:30 AM

Tested on 10k pipelines from Vulkan games, this patch reduces the number of v_writelane instructions from 9842 to 6440 (at the expense of using more loads of course).

Any objections to pushing this?