This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants
ClosedPublic

Authored by rampitec on Jun 24 2021, 1:54 PM.

Details

Summary

This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.

This gives 10-20% uplift in a set of huge apps heavily using double
precession math.

Fixes: SWDEV-292645

Diff Detail

Event Timeline

rampitec created this revision.Jun 24 2021, 1:54 PM
rampitec requested review of this revision.Jun 24 2021, 1:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2021, 1:54 PM
Herald added a subscriber: wdng. · View Herald Transcript

Do you have a MIR example that fails to rematerialize as is? We split so many 64-bit operations it might be worthwhile just teaching the coalescer how to handle it

Do you have a MIR example that fails to rematerialize as is? We split so many 64-bit operations it might be worthwhile just teaching the coalescer how to handle it

We fail to rematerialize this newly added test remat-fp64-constants.ll, it shall be fairly easy to stop it before RA.
I agree if we can teach RA to materialize subregs it would be the best, but honestly I have failed.

Do you have a MIR example that fails to rematerialize as is? We split so many 64-bit operations it might be worthwhile just teaching the coalescer how to handle it

We fail to rematerialize this newly added test remat-fp64-constants.ll, it shall be fairly easy to stop it before RA.
I agree if we can teach RA to materialize subregs it would be the best, but honestly I have failed.

I have sent you the mir. If you have an idea how to dial with LiveRangeEdit::canRematerializeAt() with subregs, that would be much nicer.

foad added inline comments.Jun 25 2021, 3:57 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2698

Does this ever return true? If I understand the isel changes, if these conditions were satisfied, you would have selected S_MOV_B64 instead of S_MOV_B64_IMM_PSEUDO.

rampitec added inline comments.Jun 25 2021, 8:15 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2698

One can eventually end up here after a folding.

rampitec updated this revision to Diff 354562.Jun 25 2021, 11:31 AM

Fixed some tidy warnings.

rampitec updated this revision to Diff 354579.Jun 25 2021, 12:27 PM
rampitec marked an inline comment as done.

Dropped dead code in the isFoldableCopy().

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2698

Actually you are right, it is dead given checks when it is produced.

I do not actually think we can fix or at least easily fix rematerialization itself when we have subregs, something like this:

undef %25.sub1:sreg_64 = S_MOV_B32 1056702873
%25.sub0:sreg_64 = S_MOV_B32 2093606576

Moving a first instruction will leave a second one with %25 use without a def. We would need to always move all defs and that hardly counts as trivially rematerializable, especially if we have multiple defs.

Although that seems to be possible to improve this and combine individual 32 bit moves into this pseudo inside the RA pipeline, after coalescer and right before the pre-RA scheduler. That way greedy will see a single instruction, but we also allow coalescer to work. This should remove the heuristic aspect of it, although will need a new pass and SReg_64 defs scan.

@qcolombet do you mind to tell your opinion, if that still may be rematerializable in the RA?

undef %25.sub1:sreg_64 = S_MOV_B32 1056702873
%25.sub0:sreg_64 = S_MOV_B32 2093606576

I would really appreciate a second thought.

rampitec updated this revision to Diff 355353.Jun 29 2021, 1:49 PM
rampitec retitled this revision from [AMDGPU] Add S_MOV_B64_IMM_PSEUDO for fp constants to [AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants.
rampitec edited the summary of this revision. (Show Details)

Removed selection part, we will still select split wide constants.
Added pass to combine such assignments into the new pseudo after the coalescer and rename independent subregs.
This gives better results at a cost of compile time.

Probably should rename the pass to be a bit more generic. There are other optimizations we could potentially do here (e.g. this is really where we probably should be optimizing m0 initializations instead of what we do now, and maybe some AGPR-VGPR fixups)

llvm/lib/Target/AMDGPU/GCNCombineTupleInits.cpp
87 ↗(On Diff #355353)

uint64_t

107 ↗(On Diff #355353)

Need to make sure this shift is done as uint64_t to avoid UB

115 ↗(On Diff #355353)

Typo Comining

133 ↗(On Diff #355353)

Don't need the newline

154–155 ↗(On Diff #355353)

Initialize and define at the same time?

156 ↗(On Diff #355353)

Why not also handle the VALU cases?

arsenm added inline comments.Jun 29 2021, 3:54 PM
llvm/lib/Target/AMDGPU/GCNCombineTupleInits.cpp
94 ↗(On Diff #355353)

What happens if there are additional defs? I also don't see a dedicated MIR test for this?

dfukalov added inline comments.
llvm/test/CodeGen/AMDGPU/inline-constraints.ll
61–62

Can be removed?

rampitec updated this revision to Diff 355404.Jun 29 2021, 4:55 PM
rampitec marked 7 inline comments as done.

Addressed review comments.

llvm/lib/Target/AMDGPU/GCNCombineTupleInits.cpp
94 ↗(On Diff #355353)

If a Def0 or Def1 not null and we need to assign it function exits. Added mir test.

156 ↗(On Diff #355353)

I have tried by I see no benefits. We can add it later if we see any, I just didn't. Doing this on VALU will also need an EXEC modification scan between defs. In addition we do not always conver a v_mov_b32 into brev, so I have also seen some regressions. That would need to be addressed first too.

PSDB and ePSDB passed.

arsenm accepted this revision.Jun 30 2021, 11:01 AM

LGTM with missing case added

llvm/test/CodeGen/AMDGPU/combine-sreg64-inits.mir
77

Also should check the case with the implicit operand on the moves

This revision is now accepted and ready to land.Jun 30 2021, 11:01 AM
rampitec updated this revision to Diff 355644.Jun 30 2021, 11:35 AM
rampitec marked an inline comment as done.

Added tests with imp-use and imp-def.

This revision was landed with ongoing or failed builds.Jun 30 2021, 11:46 AM
This revision was automatically updated to reflect the committed changes.