This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use aperture registers instead of S_GETREG
ClosedPublic

Authored by Pierre-vh on Nov 7 2022, 5:03 AM.

Details

Summary

Fixes a longstanding TODO in the codebase where we were using S_GETREG + shift to do something that could simply be done with an inline constant (register).

Patch based on D31874 by @kzhuravl
Depends on D137767

Diff Detail

Event Timeline

Pierre-vh created this revision.Nov 7 2022, 5:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 7 2022, 5:03 AM
Pierre-vh requested review of this revision.Nov 7 2022, 5:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 7 2022, 5:03 AM

Note: I'm unsure if codegen is actually correct for this. I don't understand this bit particularly well but instead of leaving this one the side I thought I would give it a shot and discuss whether it's correct during review.

foad added a comment.Nov 7 2022, 5:38 AM
This comment was removed by foad.
Pierre-vh updated this revision to Diff 473645.Nov 7 2022, 6:04 AM

Restore ID_MEM_BASE, it is needed to support parsing/rewriting getreg

arsenm requested changes to this revision.Nov 7 2022, 7:15 AM

I'm assuming you haven't tried running this. I have another attempt for this I did about 2 years ago I can send you. My conclusion then was this is a broken feature, and it's not really a 32-bit register. Instead, this seems to be a 64-bit register and any 32-bit operand ends up reading the low bits (which are always 0). We should still try to use them, but you have to extract the high bits with a shift

This revision now requires changes to proceed.Nov 7 2022, 7:15 AM
arsenm added a comment.Nov 7 2022, 7:25 AM

I'm assuming you haven't tried running this. I have another attempt for this I did about 2 years ago I can send you.

I can't find the piece that was anymore than what you have. I know I had a patch emitting a 64-bit shift somewhere

I'm assuming you haven't tried running this. I have another attempt for this I did about 2 years ago I can send you. My conclusion then was this is a broken feature, and it's not really a 32-bit register. Instead, this seems to be a 64-bit register and any 32-bit operand ends up reading the low bits (which are always 0). We should still try to use them, but you have to extract the high bits with a shift

I indeed haven't tried running it yet, I planned too but some issues prevented me from doing it.
Do you mean that we need to add a 32 bits right-shift to src_shared_based before using it?
Should we just store it in a SGPR pair and use the high register maybe? (e.g. s_mov_b64 s[0:1], src_shared_base then use s0 (or is it s1? I always forget)

Pierre-vh updated this revision to Diff 474243.EditedNov 9 2022, 5:54 AM

So I indeed checked ocltst and the previous version crashed.
Now it looks fine, but the verifier crashes in a lot of tests, especially GISel ones.
The issue is that we need to use this register as a 64 bit operand but it's a 32 bit register, so the verifier complains on S_MOV_B64.

How can I solve this? Is there another instruction I could use or do we have to change the RC of the src_shared/private_base register to 64 bit? (ideally it should be available for both, no?)

Note: tests haven't all been updated yet because of the verifier crashes.

foad added a comment.Nov 9 2022, 6:02 AM

do we have to change the RC of the src_shared/private_base register to 64 bit?

Yes that sounds right.

(ideally it should be available for both, no?)

That would be useful bu apparently it's not how the hardware works. I guess it gives you the 32 low bits which is not very useful because it will always be 0 for src_*_base and -1 for src_*_limit.

do we have to change the RC of the src_shared/private_base register to 64 bit?

Yes that sounds right.

(ideally it should be available for both, no?)

That would be useful bu apparently it's not how the hardware works. I guess it gives you the 32 low bits which is not very useful because it will always be 0 for src_*_base and -1 for src_*_limit.

Which RegisterClass should I use for it? Just SGPR64?

foad added a comment.Nov 9 2022, 6:09 AM

do we have to change the RC of the src_shared/private_base register to 64 bit?

Yes that sounds right.

(ideally it should be available for both, no?)

That would be useful bu apparently it's not how the hardware works. I guess it gives you the 32 low bits which is not very useful because it will always be 0 for src_*_base and -1 for src_*_limit.

Which RegisterClass should I use for it? Just SGPR64?

I don't know. I'm not too familiar with that stuff. Can you copy what we do for SGPR_NULL64?

Pierre-vh planned changes to this revision.Nov 9 2022, 7:26 AM

Need to be rebased on a future patch that will add the 64 bit variant of the aperture registers (almost done, just need to get tests to pass)

Pierre-vh edited the summary of this revision. (Show Details)Nov 10 2022, 12:18 AM
Pierre-vh added a reviewer: foad.

How have you tested this? OpenCL conformance flat tests with -O0 and -O2 should be good enough

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
1829

You can just build a copy at this point. We should use copies and mark it as a constant register

Pierre-vh marked an inline comment as done.Nov 15 2022, 2:59 AM

How have you tested this? OpenCL conformance flat tests with -O0 and -O2 should be good enough

I used https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/tree/develop/tests/ocltst, do you mean I should use https://github.com/KhronosGroup/OpenCL-CTS?
Which test should I run? All of them? I don't see a "flat" category

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
1829

If I use a COPY it'll think it's okay to eliminate it and use the _HI register directly (which is not supposed to exist but I had to declare it to fix tablegen issues)
How can I prevent that? I'm not fully satisfied with how things are done in D137767 - I'd really like to eliminate that "HI" register but I'm not sure how.

Pierre-vh marked an inline comment as done.

Use COPY (restricting to SGPR works, I got confused and was restricting to SREG)

arsenm added inline comments.Nov 16 2022, 8:56 AM
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
1829

It's too early to set a register class here. In principle it should work, but it would be better to let the selection constrain the register class normally

How have you tested this? OpenCL conformance flat tests with -O0 and -O2 should be good enough

I used https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/tree/develop/tests/ocltst, do you mean I should use https://github.com/KhronosGroup/OpenCL-CTS?
Which test should I run? All of them? I don't see a "flat" category

https://github.com/KhronosGroup/OpenCL-CTS/tree/main/test_conformance/generic_address_space

Pierre-vh updated this revision to Diff 477113.Nov 22 2022, 1:39 AM

Rebase, constrain in select instead

Note: the machine where my full dev setup is is down at the moment so I can't run OCLTst right now, will do it as soon as possible (and before landing definitely)

arsenm added inline comments.Nov 22 2022, 6:24 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
119 ↗(On Diff #477113)

Lowercase is

149–150 ↗(On Diff #477113)

You shouldn't need to special case this (also note there's no equivalent for the DAG). Either it needs to belong to a different class from SReg_64, or you need to reserve the high bits

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5521–5524

This will get combined later to bitcast and extract later, but I guess that's more annoying to emit

llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-copy.mir
485–488 ↗(On Diff #477113)

A basic selection test shouldn't have pre-set register classes

arsenm requested changes to this revision.Nov 22 2022, 6:25 AM
This revision now requires changes to proceed.Nov 22 2022, 6:25 AM
Pierre-vh added inline comments.Nov 22 2022, 6:34 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
149–150 ↗(On Diff #477113)

or you need to reserve the high bits

Aren't they reserved with reserveRegisterTuples?

If I create a different class, do you mean I should exclude those aperture registers from SReg and create a superclass with both SReg and the apertures? I'm not sure I understand how the class will help if I shouldn't special-case the uses of those aperture registers

arsenm added inline comments.Nov 22 2022, 7:33 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
149–150 ↗(On Diff #477113)

If the reserve didn't do it, then yes. Using different classes is the correct way to represent use restrictions. Really what we want is a class without the sub1 subregister, but has a hack we're trying to avoid that, so a class that excludes the aperture regs is the next best thing.

Not sure if you really need the reserve after that.

Pierre-vh added inline comments.Nov 23 2022, 4:06 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
149–150 ↗(On Diff #477113)

Do I need to do something similar to SRegOrLds, then?

Also:

  • Do I need to do this for 32 bit apertures or 64 bit apertures only?
  • If I remove Aperture registers from SReg and create a superclass, won't I need to update instruction definitions to use that superclass instead of SReg? Won't that require a lot of changes?

And I suppose all those changes should be in a separate patch, right?

arsenm added inline comments.Nov 23 2022, 10:23 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
149–150 ↗(On Diff #477113)

Do I need to do this for 32 bit apertures or 64 bit apertures only?

Well the 32-bit register cases are purely artificial. I'd assume these are included only by the broadest set of SGPRs (ideally wouldn't be in any allocatable set)

If I remove Aperture registers from SReg and create a superclass, won't I need to update instruction definitions to use that superclass instead of SReg? Won't that require a lot of changes?

This is a question of naming. I'd probably lean towards leaving the SReg name/operands as-is, and introducing an allocatable subclass that excludes them (e.g. like SReg_64_XEXEC)

Pierre-vh updated this revision to Diff 478210.Nov 28 2022, 5:58 AM
Pierre-vh marked an inline comment as done.

Addressing comments following discussion

arsenm accepted this revision.Nov 28 2022, 6:44 AM

LGTM assuming this actually works

This revision is now accepted and ready to land.Nov 28 2022, 6:44 AM
This revision was landed with ongoing or failed builds.Nov 30 2022, 4:25 AM
This revision was automatically updated to reflect the committed changes.