This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] New AMDGPUInsertDelayAlu pass
ClosedPublic

Authored by foad on Jun 21 2022, 6:00 AM.

Details

Reviewers
Joe_Nash
rampitec
piotr
jpages
Group Reviewers
Restricted Project
Commits
rGcfb7ffdec0eb: [AMDGPU] New AMDGPUInsertDelayAlu pass

Diff Detail

Event Timeline

foad created this revision.Jun 21 2022, 6:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2022, 6:00 AM
foad requested review of this revision.Jun 21 2022, 6:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2022, 6:00 AM
foad added a reviewer: Restricted Project.Jun 21 2022, 6:01 AM
foad added a comment.Jun 21 2022, 6:03 AM

For tests with generated checks: if the only difference was adding a bunch of s_delay_alu instructions then I regenerated the checks. If it was more disruptive (e.g. if it meant that shared prefixes could no longer be used) then I added -amdgpu-enable-delay-alu=0 to the GFX11 RUN lines instead.

arsenm added inline comments.Jun 22 2022, 5:54 PM
llvm/lib/Target/AMDGPU/AMDGPUInsertDelayAlu.cpp
162

Should check LLVM_ENABLE_DUMP instead

217

Bad autos, I'm not even sure what the type is supposed to be here

301–302

Can merge these into one LLVM_DEBUG. Also should use printMBBReference

313–320

This is D128313 (except for SI_RETURN_TO_EPILOG which I'm not sure about since it's a bit weird)

323–328

Should extract this into a separate predicate function

330–332

Ditto

345–348

Why does the opcode need special casing here? Why not every tied operand?

385–386

One LLVM_DEBUG

foad updated this revision to Diff 439299.Jun 23 2022, 2:51 AM

Address review comments.

foad marked 7 inline comments as done.Jun 23 2022, 2:59 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUInsertDelayAlu.cpp
301–302

Done, but the single LLVM_DEBUG looks uglier to me.

313–320

Can update this when that patch lands.

345–348

Because the hardware does not see a RAW dependency here:

v_mov_b32 v0, 0
v_writelane_b32 v0, s0, 0

At the thread level there is no dependency. The MIR instruction uses a tied read-write operand for v0, which I suppose is a wave-level representation where the read represents the lanes that are not modified.

Other instructions with tied operands (like V_MAC) represent a real read of a VGPR in all active lanes, so we do want to model a delay for them.

foad updated this revision to Diff 441061.Jun 29 2022, 9:52 AM
foad marked 2 inline comments as done.

Rebase on D128313.

Joe_Nash accepted this revision.Jun 29 2022, 10:24 AM
This revision is now accepted and ready to land.Jun 29 2022, 10:24 AM
jpages accepted this revision.Jun 29 2022, 10:43 AM
This revision was landed with ongoing or failed builds.Jun 29 2022, 1:33 PM
This revision was automatically updated to reflect the committed changes.