This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use flat scratch instructions where available
ClosedPublic

Authored by rampitec on Oct 9 2020, 4:20 PM.

Details

Summary

The support is disabled by default. So far there is instruction
selection, spilling, and frame elimination. It also changes SP
from unswizzled to swizzled as used by flat scratch instructions,
so it cannot be mixed with MUBUF stack access.

At the very least missing:

  • GlobalISel;
  • Some optimizations in frame elimination in between vector and scalar ALU;
  • It shall finally allow to always materialize frame index as an SGPR, but that is not implemented and frame elimination cannot handle it yet;
  • Unaligned and/or multidword flat scratch shall work, but it is legalized now for MUBUF;
  • Operand folding cannot optimize FI like with MUBUF yet;
  • It will need scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address;

Diff Detail

Event Timeline

rampitec created this revision.Oct 9 2020, 4:20 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 9 2020, 4:20 PM
rampitec requested review of this revision.Oct 9 2020, 4:20 PM

I haven't done a meaningful review, but I wanted to note that this will require changes to the debug information (which isn't committed yet). I think this could be as simple as scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
430

s/Scracth/Scratch/

rampitec updated this revision to Diff 297646.Oct 12 2020, 11:22 AM
rampitec marked an inline comment as done.
rampitec edited the summary of this revision. (Show Details)

I haven't done a meaningful review, but I wanted to note that this will require changes to the debug information (which isn't committed yet). I think this could be as simple as scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address.

Thanks! I have fixed the typo and added DWARF update to the commit message.

I haven't done a meaningful review, but I wanted to note that this will require changes to the debug information (which isn't committed yet). I think this could be as simple as scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address.

Is this change changing the call convention ABI? For example, making the SP be a swizzled address as opposed to a swizzled address? If so then AMDGPUUsage will also need updating.

I haven't done a meaningful review, but I wanted to note that this will require changes to the debug information (which isn't committed yet). I think this could be as simple as scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address.

Is this change changing the call convention ABI? For example, making the SP be a swizzled address as opposed to a swizzled address? If so then AMDGPUUsage will also need updating.

Yes it does. However, it is a little premature to update the documentation. This is WIP, disabled by default and more or less not working at least until spilling is implemented. When it is at least working we can consider documenting it. Documenting it earlier just gives an impression there is an option to use it.

rampitec updated this revision to Diff 297963.Oct 13 2020, 2:34 PM

Fixed typo in test check.

rampitec planned changes to this revision.Oct 15 2020, 10:58 AM
rampitec added a reviewer: sebastian-ne.

Testing showed couple problems:

  1. Debug tablegen asserts with this.
  2. Using null register in flat scratch does not work, but it needs a new ST addressing mode of GFX10. I will create a separate patch to support ST mode.
rampitec updated this revision to Diff 298431.Oct 15 2020, 11:52 AM

Fixed operand order in store pattern.

rampitec planned changes to this revision.Oct 15 2020, 11:52 AM

Still needs ST mode.

This will change the ABI, so I don't think belongs as a subtarget property

This will change the ABI, so I don't think belongs as a subtarget property

The ABI will in fact depend on the subtarget. We can only use it starting from gfx9, and then even on gfx9 it might not be desirable. GFX10 is better in this respect.
Anyway, I need a subtarget to decide if we even have flat scratch instructions. So far this switch is experimental, but if you have an idea of a better placement please tell.

rampitec updated this revision to Diff 298509.Oct 15 2020, 4:26 PM

Use ST mode on GFX10 instead of NULL register.

Flakebi added inline comments.
llvm/lib/Target/AMDGPU/FLATInstructions.td
873

Should this be called ScratchLoadSignedPat_D16?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1445–1450

I get a failing assert here with NewOpc = 4294967295:

llvm/include/llvm/MC/MCInstrInfo.h:63: const llvm::MCInstrDesc &llvm::MCInstrInfo::get(unsigned int) const: Assertion `Opcode < NumOpcodes && "Invalid opcode!"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: compiler/llpc/amdllpc -gfxip=10.1 -amdgpu-enable-flat-scratch /pipelines/PipelineVsFs_0x1BEFB7D1A235B4F6.pipe -verify-machineinstrs
1.      Running pass 'CallGraph Pass Manager' on module 'lgcPipeline'.
2.      Running pass 'Prologue/Epilogue Insertion & Frame Finalization' on function '@_amdgpu_ps_main'
 #0 0x00000000023f0db1 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /llvm/lib/Support/Unix/Signals.inc:563:13
 #1 0x00000000023ef060 llvm::sys::RunSignalHandlers() /llvm/lib/Support/Signals.cpp:72:18
 #2 0x00000000023f1152 SignalHandler(int) /llvm/lib/Support/Unix/Signals.inc:0:3
 #3 0x00007fadd6ebfee0 __restore_rt (/glibc-2.31/lib/libpthread.so.0+0x12ee0)
 #4 0x00007fadd6d0c08a raise (/glibc-2.31/lib/libc.so.6+0x3808a)
 #5 0x00007fadd6cf6528 abort (/glibc-2.31/lib/libc.so.6+0x22528)
 #6 0x00007fadd6cf640f _nl_load_domain.cold.0 (/glibc-2.31/lib/libc.so.6+0x2240f)
 #7 0x00007fadd6d04a02 (/glibc-2.31/lib/libc.so.6+0x30a02)
 #8 0x0000000001a03170 llvm::SIRegisterInfo::eliminateFrameIndex(llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>, int, unsigned int, llvm::RegScavenger*) const /llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:1465:11
 #9 0x000000000214e0f3 (anonymous namespace)::PEI::replaceFrameIndices(llvm::MachineBasicBlock*, llvm::MachineFunction&, int&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:0:11
#10 0x000000000214caef llvm::MachineBasicBlock::getNumber() const /llvm/include/llvm/CodeGen/MachineBasicBlock.h:904:34
#11 0x000000000214caef (anonymous namespace)::PEI::replaceFrameIndices(llvm::MachineFunction&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:1161:17
#12 0x000000000214caef (anonymous namespace)::PEI::runOnMachineFunction(llvm::MachineFunction&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:269:3
#13 0x0000000002031e7e llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /llvm/lib/CodeGen/MachineFunctionPass.cpp:0:13
#14 0x0000000003136a85 llvm::FPPassManager::runOnFunction(llvm::Function&) /llvm/lib/IR/LegacyPassManager.cpp:1519:27
#15 0x0000000001c76b38 (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:178:25
#16 0x0000000001c76b38 (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:476:9
#17 0x0000000001c76b38 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:541:18
#18 0x0000000003137149 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /llvm/lib/IR/LegacyPassManager.cpp:0:27
#19 0x0000000003137149 llvm::legacy::PassManagerImpl::run(llvm::Module&) /llvm/lib/IR/LegacyPassManager.cpp:615:44
…
rampitec updated this revision to Diff 298650.Oct 16 2020, 9:40 AM
rampitec marked an inline comment as done.

Renamed pattern.

rampitec added inline comments.Oct 16 2020, 10:10 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1445–1450

I cannot reproduce this. Take in mind that D89424 is not updated to use ST mode yet, so they do not work together yet.

rampitec updated this revision to Diff 299183.Oct 19 2020, 2:56 PM

Rebased to parent.

rampitec updated this revision to Diff 299185.Oct 19 2020, 3:01 PM

Correct rebase patch.

arsenm added inline comments.Oct 19 2020, 3:30 PM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1678

Swap these?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1478

What happens if this needs an SGPR spill?

rampitec updated this revision to Diff 299210.Oct 19 2020, 3:55 PM
rampitec marked an inline comment as done.
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1478

It it can scavenge it it shall be fine as offset shall not change. If not I guess I would need to adjust SP and revert it. I have added FIXME here.

Flakebi added inline comments.Oct 20 2020, 7:59 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
544

Should this be MFI->hasFlatScratchInit() || (ST.enableFlatScratch() && requiresStackPointerReference(MF))?
Otherwise, the scratch does not get initialized (I guess it’s fine to do that in a later patch).

rampitec added inline comments.Oct 20 2020, 10:51 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
544

Do you see it not initialized? In the SIMachineFunctionInfo() there is this code:

if (ST.hasFlatAddressSpace() && isEntryFunction() && isAmdHsaOrMesa) {
  // TODO: This could be refined a lot. The attribute is a poor way of
  // detecting calls or stack objects that may require it before argument
  // lowering.
  if (HasCalls || HasStackObjects)
    FlatScratchInit = true;
}

So I assume it has to be initialized. Probably the culprit is this isAmdHsaOrMesa condition? It may be needed to say (isAmdHsaOrMesa || ST.enableFlatScratch()) instead. For some reason this code is not executed for amdpal, I do not see an obvious reason why.

rampitec added inline comments.Oct 20 2020, 11:22 AM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
544

Actually I see it uninitialized in my own test. But the code as suggested does not always work because requiresStackPointerReference() is not necessarily true if we have just private loads, we need to make sure hasFlatScratchInit() is set.

rampitec updated this revision to Diff 299439.Oct 20 2020, 12:25 PM
rampitec marked an inline comment as done.
  • Ensure flat scratch initialization;
  • Added asserts around scavenger calls until there is a better handling of failed scavenging;
rampitec updated this revision to Diff 299818.Oct 21 2020, 3:31 PM
rampitec retitled this revision from [AMDGPU] Select flat scratch instructions where available to [AMDGPU] Use flat scratch instructions where available.
rampitec edited the summary of this revision. (Show Details)
  • Integrated spilling from child revision, child is dropped;
  • Fixed situation when an SGPR has to be spilled while scavenging in frame elimination;

I also came to conclusion that the only robust way to have no failed scavenging during frame lowering is to always have an sp or fp. Otherwise it can fail regardless of the spilling method. The only other way is to have an instruction with full 32 bit immediate offset. I.e. it can fail in a kernel with MUBUF as well.

I also came to conclusion that the only robust way to have no failed scavenging during frame lowering is to always have an sp or fp. Otherwise it can fail regardless of the spilling method. The only other way is to have an instruction with full 32 bit immediate offset. I.e. it can fail in a kernel with MUBUF as well.

I was considering requiring an FP if the stack size was starting to hit the offset limit, but was unable to come up with a testcase where it would really break

I also came to conclusion that the only robust way to have no failed scavenging during frame lowering is to always have an sp or fp. Otherwise it can fail regardless of the spilling method. The only other way is to have an instruction with full 32 bit immediate offset. I.e. it can fail in a kernel with MUBUF as well.

I was considering requiring an FP if the stack size was starting to hit the offset limit, but was unable to come up with a testcase where it would really break

This sounds like a good idea. We can run into a situation when we can scavenge nothing at all, even if it is not easy to forge a testcase. That is more so with flat scratch until ST mode is available as you always need a register as a base. In fact in this scenario it may be needed even if potential offsets are small. Then we do not need buffer descriptor with flat scratch, so we are saving 4 SGPRs. It sounds fair to use one for the base pointer instead.

rampitec updated this revision to Diff 300086.Oct 22 2020, 12:58 PM

Fixed a need of SGPR spill during VGPR spilling on targets w/o flat scratch ST mode, reused existing code adjusting offsets.

rampitec updated this revision to Diff 300113.Oct 22 2020, 3:25 PM

Fixed issue with flat scratch not always being initialized. It was not initialized if we had no stack objects or calls, but later did spilling.
It is too late to insert system SGPRs at frame lowering, so initialize it always if flat scratch is used.

arsenm added inline comments.Oct 23 2020, 8:35 AM
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1559–1560

This should really be a pattern predicate. I was recently working on fixing these explicit subtarget checks in the complex patterns recently but didn't finish

arsenm added inline comments.Oct 23 2020, 8:39 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
4308–4311 ↗(On Diff #300113)

Unrelated change?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
779–783

This looks backwards with the negated conditions

rampitec updated this revision to Diff 300315.Oct 23 2020, 8:51 AM
rampitec marked 2 inline comments as done.

Corrected IsOffsetLegal to remove negation.

rampitec updated this revision to Diff 300322.Oct 23 2020, 9:17 AM
rampitec marked an inline comment as done.

Moved predicates from complex patterns into td files.

rampitec added inline comments.Oct 23 2020, 10:56 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
4308–4311 ↗(On Diff #300113)

It is related, we just never hit it before. I am probing a physical SGPR to see if it is legal. RC is SReg_32, but DRC for scratch instructions is SReg_32_XEXEC_HI and test fails.

rampitec marked an inline comment as done.Oct 23 2020, 11:37 AM
rampitec updated this revision to Diff 300368.Oct 23 2020, 12:04 PM

Removed unrelated subtarget change.

Flakebi accepted this revision.Oct 26 2020, 8:30 AM

Looks good to me.
I tested it with the amdvlk vulkan driver (needs a pal-specific patch) and a short Vulkan CTS test ran through fine (except for pal-related failures).

This revision is now accepted and ready to land.Oct 26 2020, 8:30 AM
rampitec requested review of this revision.Oct 26 2020, 2:31 PM
This revision was not accepted when it landed; it landed in state Needs Review.Oct 26 2020, 2:41 PM
This revision was automatically updated to reflect the committed changes.

It looks like this broke the windows lldb bot:

http://lab.llvm.org:8011/#/builders/83/builds/336

Oops! Release build in fact. Will fix soon.