This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Reserve v32 if we may need to copy between AGPRs on gfx908
ClosedPublic

Authored by arsenm on Jan 27 2022, 4:10 PM.

Details

Reviewers
rampitec
cdevadas
Group Reviewers
Restricted Project
Summary

We need to guarantee cheap copies between AGPRs, and unfortunately
gfx908 cannot directly do this. Theoretically we could set the
scavenger up with an emergency spill slot, but it also feels
unreasonable to pay that cost for what was assumed to be a simple and
cheap copy. Pick a register that doesn't conflict with any ABI
registers.

This does not address the same issue when copying from SGPR to AGPR
for gfx90a (this coincidentally fixes it for gfx908), but that's less
interesting since the register allocator shouldn't be proactively
introducing such copies.

One edge case I'm worried about is respecting the VGPR budget implied
by amdgpu-waves-per-eu. If the theoretical upper bound of a function
is 32 VGPRs, this will force the actual count to be 33.

Diff Detail

Event Timeline

arsenm created this revision.Jan 27 2022, 4:10 PM
arsenm requested review of this revision.Jan 27 2022, 4:10 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2022, 4:10 PM
Herald added a subscriber: wdng. · View Herald Transcript

I think it has to be dynamic depending on the requested occupancy. But even then it can drop the occupancy of a kernel if it uses less than 32 registers, which not uncommon. I do not believe we can reserve it that high.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
601

It misses error reporting and setRegUsed call.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
639

You can skip it if !MFI->usesAGPRs(MF).

I think it has to be dynamic depending on the requested occupancy. But even then it can drop the occupancy of a kernel if it uses less than 32 registers, which not uncommon. I do not believe we can reserve it that high.

Reserving in the function argument range is a problem. We also treat the requested occupancy as a hint, not something we're forced to follow

I think it has to be dynamic depending on the requested occupancy. But even then it can drop the occupancy of a kernel if it uses less than 32 registers, which not uncommon. I do not believe we can reserve it that high.

Reserving in the function argument range is a problem. We also treat the requested occupancy as a hint, not something we're forced to follow

Documentation (https://clang.llvm.org/docs/AttributeReference.html#amdgpu-waves-per-eu) is a bit contradictory:

An error will be given if:

  • Specified values violate subtarget specifications;
  • Specified values are not compatible with values provided through other attributes;
  • The AMDGPU target backend is unable to create machine code that can meet the request.

So it is an error. But then:

This attribute may be attached to a kernel function definition and is an optimization hint.

Anyway, inability to run kernels at maximum occupancy is a show stopper itself.

I think it has to be dynamic depending on the requested occupancy. But even then it can drop the occupancy of a kernel if it uses less than 32 registers, which not uncommon. I do not believe we can reserve it that high.

Reserving in the function argument range is a problem. We also treat the requested occupancy as a hint, not something we're forced to follow

Documentation (https://clang.llvm.org/docs/AttributeReference.html#amdgpu-waves-per-eu) is a bit contradictory:

An error will be given if:

  • Specified values violate subtarget specifications;
  • Specified values are not compatible with values provided through other attributes;
  • The AMDGPU target backend is unable to create machine code that can meet the request.

This was never implemented as an error and I remember the intent was to make this fuzzy so your code would not break if there was a change in subtarget behavior

This attribute may be attached to a kernel function definition and is an optimization hint.

Anyway, inability to run kernels at maximum occupancy is a show stopper itself.

This is practically impossible if you are using mfma instructions anyway

Anyway, inability to run kernels at maximum occupancy is a show stopper itself.

This is practically impossible if you are using mfma instructions anyway

This might be OK, but consider two things: 1) you are not checking agprs are unused (easy to fix) 2) there are some mfma instructions which only need a128. I don't believe there are kernels which fit that budget, but I cannot blidnly deny it too.

I.e. I would prefer to fail compilation rather than reserving v32.

Anyway, inability to run kernels at maximum occupancy is a show stopper itself.

This is practically impossible if you are using mfma instructions anyway

This might be OK, but consider two things: 1) you are not checking agprs are unused (easy to fix) 2) there are some mfma instructions which only need a128. I don't believe there are kernels which fit that budget, but I cannot blidnly deny it too.

I.e. I would prefer to fail compilation rather than reserving v32.

I guess disallowing direct agpr -> agpr copy for gfx908 early from instruction selection would be a better fix. I am anyway working on a patch to avoid direct sgpr -> agpr copies. In that way, we could avoid Scavenger altogether while lowering COPY instructions.

I guess disallowing direct agpr -> agpr copy for gfx908 early from instruction selection would be a better fix. I am anyway working on a patch to avoid direct sgpr -> agpr copies. In that way, we could avoid Scavenger altogether while lowering COPY instructions.

Note though copies are created in many passes.

I guess disallowing direct agpr -> agpr copy for gfx908 early from instruction selection would be a better fix. I am anyway working on a patch to avoid direct sgpr -> agpr copies. In that way, we could avoid Scavenger altogether while lowering COPY instructions.

Note though copies are created in many passes.

Ah, the copies during register coalescer can't be avoided. Never mind.

arsenm updated this revision to Diff 405415.Feb 2 2022, 1:33 PM

Skip if there aren't any AGPRs, add test where inline asm is broken

rampitec added inline comments.Feb 2 2022, 2:21 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
631

Still "&& !ST.hasGFX90AInsts()".
Use logical and.

arsenm updated this revision to Diff 405451.Feb 2 2022, 2:27 PM

Fix typo

On the second thought, scavengeRegister will spill if nothing is available. Isn't that sufficient?

On the second thought, scavengeRegister will spill if nothing is available. Isn't that sufficient?

There's no emergency scavenging slot set up for the scavenger here. It's also insane and quadratic to set up the scavenger for every copy we need to process. The assumption is also that copies between 2 registers should always be fast, which isn't the case if there could be a spill here

On the second thought, scavengeRegister will spill if nothing is available. Isn't that sufficient?

There's no emergency scavenging slot set up for the scavenger here. It's also insane and quadratic to set up the scavenger for every copy we need to process. The assumption is also that copies between 2 registers should always be fast, which isn't the case if there could be a spill here

It's not every copy, it is an emergency situation though? Reserving a register in the middle of the register file is really a last resort thing.

On the second thought, scavengeRegister will spill if nothing is available. Isn't that sufficient?

There's no emergency scavenging slot set up for the scavenger here. It's also insane and quadratic to set up the scavenger for every copy we need to process. The assumption is also that copies between 2 registers should always be fast, which isn't the case if there could be a spill here

It's not every copy, it is an emergency situation though? Reserving a register in the middle of the register file is really a last resort thing.

The fact that it can happen from just a copy is an issue. Combined with the issue in D115401, I don't see a perfectly reliable system without reserving a register. If we wanted to scavenge here, we would need to have the pass maintain the liveness instead of repeating it every iteration which adds a lot more expense too.

rampitec added inline comments.Feb 3 2022, 12:02 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
601

Ping.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
631

Still need !ST.hasGFX90AInsts().

arsenm added inline comments.Feb 4 2022, 12:03 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
631

hasGFX90AInsts is the if preceding this else if so that doesn't reach here

arsenm updated this revision to Diff 406064.Feb 4 2022, 12:12 PM

Fix setRegUsed

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
601

The error reporting was pointless because if !AllowSpill, scavengeRegister emits the error

rampitec accepted this revision.Feb 4 2022, 12:15 PM
This revision is now accepted and ready to land.Feb 4 2022, 12:15 PM