This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][GlobalISel] Fix waterfall loops
ClosedPublic

Authored by sebastian-ne on Sep 1 2021, 3:33 AM.

Details

Summary
  • Move the s_and exec to its correct position before the content of the waterfall loop
  • Use the SI_WATERFALL pseudo instruction, like for sdag, to benefit from optimizations
  • Add support for indirect function calls

To support indirect calls, add a G_SI_CALL instruction without register
class restrictions and insert a waterfall loop when applying register
banks.

Adjust the handling of new basic blocks in RegBankSelect to cope with
the new basic blocks inserted for indirect calls.

Diff Detail

Event Timeline

sebastian-ne created this revision.Sep 1 2021, 3:33 AM
sebastian-ne requested review of this revision.Sep 1 2021, 3:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2021, 3:33 AM
foad added a comment.Sep 1 2021, 5:22 AM

No objections from me.

arsenm added inline comments.Sep 1 2021, 6:21 AM
llvm/lib/CodeGen/GlobalISel/RegBankSelect.cpp
754

2 c's

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
906–908

I think in the absence of knowing if the call target is uniform in CallLowering, we can't do tail calls

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
3142–3146

I'm not sure you can guarantee this is the range that needs to be moved. Could other instructions have been moved across these between the IRTranslator and here?

sebastian-ne marked 2 inline comments as done.

Disallow indirect tail calls

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
3142–3146

The running passes seem to be

  • IRTranslator
  • AMDGPUPreLegalizerCombiner
  • Localizer
  • Legalizer
  • AMDGPUPostLegalizerCombiner
  • RegBankSelect

It’s fine if VALU instructions are moved inside the loop, but SALU instructions would be a problem.

Can we prevent instructions from being moved into the call code?

arsenm added inline comments.Sep 7 2021, 5:06 PM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
3142–3146

I was thinking of doing this before by using virtual registers for the call pseudo and expanding to physical registers later, but it seems like overkill. Practically speaking we don't really have code reordering right now that would present an issue, and don't see why we would ever add it. The localizer does some level of reordering, but it's really just a simple sink to uses. In practice we'll only see copies inside this range. Overall I can't see a real issue doing it this way, but it still feels wrong since there's no formal guarantee

sebastian-ne added inline comments.Sep 28 2021, 3:13 AM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
3142–3146

I tried out putting the call code inside adjcallstackup to -down into a bundle when it is generated, but I hit some problems.

  1. The MachineVerifier complains about undefined physical registers when they are defined and used inside the bundle. It only considers def-ed registers live after a bundle, not inside it, which is more the style of vliw-bundles.
  2. Bundles in SSA form (before register allocation) apparently do not have a BUNDLE instruction and no defs/uses on the bundle. Also, a bundle always needs an instruction before it, so having a call as the first instruction of a block does not work, because there is no instruction before the bundle, which is rather unfortunate.
foad added inline comments.Sep 28 2021, 3:18 AM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
3142–3146

Also, a bundle always needs an instruction before it

Really? Why?

sebastian-ne added inline comments.Sep 28 2021, 4:13 AM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
3142–3146

Hm, maybe it doesn’t and just the syntax looks weird. I thought the ADJCALLSTACKUP would be outside the bundle, but probably it is part of it. (I tried adding it, which failed with an assert that I interpreted as “there must be an instruction before”, at a second look it sounds more like “it’s already part of a bundle”).

ADJCALLSTACKUP 0, 0, implicit-def $scc {
  %16:_(p4) = COPY %2:sgpr_64
  ; …
  $vgpr31 = COPY %33:_(s32)
  $sgpr30_sgpr31 = SI_CALL %15:sreg_64(p0), 0, <regmask $sgpr32 $sgpr33 $sgpr34 $sgpr35 $sgpr36 $sgpr37 $sgpr38 $sgpr39 $sgpr40 $sgpr41 $sgpr42 $sgpr43 $sgpr44 $sgpr45 $sgpr46 $sgpr47 $sgpr48 $sgpr49 $sgpr50 $sgpr51 $sgpr52 $sgpr53 $sgpr54 $sgpr55 $sgpr56 $sgpr57 $sgpr58 $sgpr59 $sgpr60 $sgpr61 $sgpr62 $sgpr63 $sgpr64 and 1092 more...>, implicit $sgpr0_sgpr1_sgpr2_sgpr3, implicit $sgpr4_sgpr5, implicit $sgpr6_sgpr7, implicit $sgpr8_sgpr9, implicit $sgpr10_sgpr11, implicit $sgpr12, implicit $sgpr13, implicit $sgpr14, implicit $vgpr31
  ADJCALLSTACKDOWN 0, 0, implicit-def $scc
}
foad added inline comments.Sep 28 2021, 4:42 AM
llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
3142–3146

Right, a normal bundle looks like

BUNDLE {
  INSN1
  INSN2
}

but as you noticed the bundle's "header" instruction doesn't need to be a BUNDLE instruction, so you can have this instead:

INSN1 {
  INSN2
  INSN3
}

Use bundles for call instructions, so no other instructions can be moved in-between.

foad added inline comments.Oct 6 2021, 6:29 AM
llvm/lib/CodeGen/GlobalISel/Localizer.cpp
172 ↗(On Diff #377518)

Why is this change needed? Normally a BUNDLE instruction has use operands representing all the uses of the instructions inside the bundle, so I think this code should just work without any changes? See comment in lowerCall...

llvm/lib/CodeGen/MachineVerifier.cpp
2273 ↗(On Diff #377518)

Why is this change needed?

Anyway I think you can write !is_contained(regsDefined, Reg).

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
1391

Use finalizeBundle instead of MIBundleBuilder, to add all the required operands to the BUNDLE instruction?

Use is_contained.

llvm/lib/CodeGen/GlobalISel/Localizer.cpp
172 ↗(On Diff #377518)

As far as I understand, it only has that after register allocation (or somewhere in these passes, like phi elimination. The pass that adds the BUNDLE instruction also adds the uses and defs).

llvm/lib/CodeGen/MachineVerifier.cpp
2273 ↗(On Diff #377518)

The MachineVerifier sees the whole bundle as a single instruction. regsLive will only be updated after the whole bundle is worked on. This makes sense for VLIW architectures like Hexagon but doesn’t make sense for grouping instructions in amdgpu.
So, making this target specific is probably the ideal way, maybe I should do this? This currently relaxes the check to allow uses to refer to defines that happen in the same bundle.

Thanks, is_contained looks useful.

llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
1391

As far as I understood, finalizeBundle should only be used after register allocation. This will be done in some pass there, insert the BUNDLE instruction and add the uses and defs. Adding the uses and defs here will fail the MachineVerifier, because you can only have a single def in SSA mode.

The bundle will be dissolved before it gets to the finalizeBundle point though.

foad added inline comments.Oct 6 2021, 8:05 AM
llvm/lib/CodeGen/GlobalISel/Localizer.cpp
172 ↗(On Diff #377518)

AMDGPU creates some bundles at instruction selection time, and it does use finalizeBundle, but they do not have any defs, so maybe that does not violate SSA? Example:

BUNDLE implicit killed %6:vgpr_32, implicit $m0, implicit $exec {
  DS_GWS_BARRIER killed %6:vgpr_32, 0, implicit $m0, implicit $exec :: (load (s32) from custom "GWSResource")
  S_WAITCNT 0
}
foad added a comment.Oct 6 2021, 8:07 AM

I have just found this thread about bundling before register allocation: https://lists.llvm.org/pipermail/llvm-dev/2017-February/110467.html

Thanks, the Worklist.insert does read better than std::copy.

I guess we need to get some agreement about the high level design: is this a good way of handling newly created instructions and BBs? In particular I wonder if there is some way of handling newly created instructions automatically, without having to call setNextInstruction, but I can't quite see how it would work.

Yes, it gets quite complicated when introducing loops for call instructions because it moves multiple instructions and now also unbundles them (so the I++ used in RegBankSelect points to the wrong instruction – after the bundle).

That comment went to the wrong review :)

sebastian-ne added inline comments.Oct 6 2021, 8:22 AM
llvm/lib/CodeGen/GlobalISel/Localizer.cpp
172 ↗(On Diff #377518)

It definitely complains if I try to finalize the bundle :)

*** Bad machine code: Multiple virtual register defs in SSA form ***
- function:    test_indirect_call_sgpr_ptr
- basic block: %bb.1  (0x6d87e8)
- instruction: BUNDLE implicit-def $scc, implicit-def %16:_(p4), implicit-def %17:_(p4), implicit-def %19:_(p4), implicit-def %20:_(s64), implicit-def %18:_(p4), implicit-def %21:_(s64), implicit-def %22:_(s32), implicit-def %23:_(s32), implicit-def %24:_(s32), implicit-def %25:_(s32), implicit-def %26:_(s32), implicit-def %27:_(s32), implicit-def %28:_(s32), implicit-def %29:_(s32), implicit-def %30:_(s32), implicit-def %31:_(s32), implicit-def %32:_(s32), implicit-def %33:_(s32), implicit-def %34:_(<4 x s32>), implicit-def $sgpr0_sgpr1_sgpr2_sgpr3, implicit-def $sgpr0, implicit-def $sgpr0_lo16, implicit-def $sgpr0_hi16, implicit-def $sgpr1, implicit-def $sgpr1_lo16, implicit-def $sgpr1_hi16, implicit-def $sgpr2, implicit-def $sgpr2_lo16, implicit-def $sgpr2_hi16, implicit-def $sgpr3, implicit-def $sgpr3_lo16, implicit-def $sgpr3_hi16, implicit-def $sgpr0_sgpr1, implicit-def $sgpr0_sgpr1_sgpr2, implicit-def $sgpr2_sgpr3, implicit-def $sgpr4_sgpr5, implicit-def $sgpr4, implicit-def $sgpr4_lo16, implicit-def $sgpr4_hi16, implicit-def $sgpr5, implicit-def $sgpr5_lo16, implicit-def $sgpr5_hi16, implicit-def $sgpr6_sgpr7, implicit-def $sgpr6, implicit-def $sgpr6_lo16, implicit-def $sgpr6_hi16, implicit-def $sgpr7, implicit-def $sgpr7_lo16, implicit-def $sgpr7_hi16, implicit-def $sgpr8_sgpr9, implicit-def $sgpr8, implicit-def $sgpr8_lo16, implicit-def $sgpr8_hi16, implicit-def $sgpr9, implicit-def $sgpr9_lo16, implicit-def $sgpr9_hi16, implicit-def $sgpr10_sgpr11, …

Do not use bundles anymore because they create more problems than they solve (e.g. the legalizer tries and fails to legalize instructions in a bundle).
Instead, select copies from and to physical registers that are used/defined by the call instruction and move them into the waterfall loop.

arsenm added inline comments.Oct 26 2021, 4:00 PM
llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
1038

Add a todo to check divergence info?

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
3137–3138

It's OK to just hardcode the opcodes here

3169

This makes me nervous about overlapping register tuples. Exact match on physical registers is rarely a good idea

sebastian-ne marked 3 inline comments as done.

Fix review comments: Use readsRegister and modifiesRegister instead of exact register matches.

arsenm accepted this revision.Oct 27 2021, 12:46 PM

Having to shuffle instructions around feels wrong but I don't have a better idea

This revision is now accepted and ready to land.Oct 27 2021, 12:46 PM
This revision was automatically updated to reflect the committed changes.