This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add IR-level pass to rewrite away address space 7
Needs ReviewPublic

Authored by krzysz00 on Aug 21 2023, 2:59 PM.

Details

Summary

This commit adds the -lower-buffer-fat-pointers pass, which is
applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type ptr addrspace(7) from
incoming IR. This must be done at the LLVM IR level because `ptr
addrspace(7)`, as a 160-bit primitive type, cannot be correctly
handled by SelectionDAG.

The detailed operation of the pass is described in comments, but, in
summary, the removal proceeds by:

  1. Rewriting loads and stores of ptr addrspace(7) to loads and stores

of i160 (including vectors and aggregates). This is needed because the
in-register representation of these pointers will stop matching their
in-memory representation in step 2, and so ptrtoint/inttoptr
operations are used to preserve the expected memory layout

  1. Mutating the IR to replace all occurrences of ptr addrspace(7)

with the type {ptr addrspace(8), ptr addrspace(6) }, which makes the
two parts of a buffer fat pointer (the 128-bit address space 8
resource and the 32-bit address space 6 offset) visible in the IR.
This also impacts the argument and return types of functions.

  1. *Splitting* the resource and offset parts. All instructions that

produce or consume buffer fat pointers (like GEP or load) are
rewritten to produce or consume the resource and offset parts
separately. For example, GEP updates the offset part of the result and
a load uses the resource and offset parts to populate the relevant
llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are
replaced by their new split counterparts, ensuring no invalidly-typed
IR escapes this pass. (For operations like call, where the struct form
is needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (
https://github.com/GPUOpen-Drivers/llpc/blob/32cda89776980202597d5bf4ed4447a1bae64047/lgc/patch/PatchBufferOp.cpp
): this pass

  • Also handles vectors of ptr addrspace(7)s
  • Also handles function boundaries
  • Includes the same uniform buffer optimization for loops and

conditionals

  • Does *not* handle memcpy() and friends (this is future work)
  • Does *not* break up large loads and stores into smaller parts. This

should be handled by extending the legalization
of *.buffer.{load,store} to handle larger types by producing multiple
instructions (the same way ordinary LOAD and STORE are legalized).
That work is planned for a followup commit.

  • Does *not* have special logic for handling divergent buffer

descriptors. The logic in LGC is, as far as I can tell, incorrect in
general, and, per discussions with @nhaehnle, isn't widely used.
Therefore, divergent descriptors are handled with waterfall loops
later in legalization.

As a final matter, this commit updates atomic expansion to treat
buffer operations analogously to global ones.

(One question for reviewers: is the new pass is the right place?
Should it be later in the pipeline?)

Diff Detail

Event Timeline

krzysz00 created this revision.Aug 21 2023, 2:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2023, 2:59 PM
krzysz00 requested review of this revision.Aug 21 2023, 2:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2023, 2:59 PM

This patch adds 4 new uses of UndefValue::get. Could you please check if some of these can be replaced with poison?
We are tying to get rid of undef.
Thanks!

krzysz00 updated this revision to Diff 552362.Aug 22 2023, 7:49 AM

Swap one of the undefs for poison

@nlopes

  • Two of the UndefValue uses are in splitting up an existing undef
  • One is respecting "When an Instruction is deleted, its debug uses change to undef. " from the documentation on how to update debug info for pass authors
  • The last has been changed to poison

@nlopes

  • Two of the UndefValue uses are in splitting up an existing undef
  • One is respecting "When an Instruction is deleted, its debug uses change to undef. " from the documentation on how to update debug info for pass authors
  • The last has been changed to poison

Thank you!

The debug information stuff hasn't been migrated to poison yet, yes.

arsenm added inline comments.Aug 23 2023, 4:36 PM
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
259

Use the queried integer type from the DataLayout?

547–548

don't you just need reverse iteration and/or make_early_inc_range?

krzysz00 updated this revision to Diff 553180.Aug 24 2023, 10:09 AM

Use data layout widths instead of hardcoded pointer widths

Also, swap from a worklist to an early_inc_range for the memory operation rewrites.

krzysz00 marked 2 inline comments as done.Aug 24 2023, 2:03 PM
krzysz00 added inline comments.Aug 25 2023, 7:22 AM
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
946

Document why this traversal order is enough

1225

@nhaehnle Upstream atomic_load from graphics?

1281

We might have instructions for this but no intrinsics?

1761

Give isBufferFatPtrConst a cache but make it fully recursive

1825

What about making a new function and moving the basic blocks? (Example needed)

piotr added a subscriber: piotr.Aug 29 2023, 6:03 AM

Thanks for working on this. Just a few high-level comments:

Does *not* handle memcpy() and friends (this is future work)

Heads-up - there is an open PR discussing their removal from the LLPC pass (https://github.com/GPUOpen-Drivers/llpc/pull/2518).

(One question for reviewers: is the new pass is the right place? Should it be later in the pipeline?)

When I was considering this a while ago, I thought the pass could be placed as late as possible to take as much advantage of generic load/store handling as possible. I would say at least after CodeGenPrepare (-codegenprepare) to take advantage of addressing mode matching (see also loosely related D137066). Is there a particular reason for the placement as in the patch?

llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
1225

This is D138786 (but appears to have stalled).

krzysz00 updated this revision to Diff 556309.Sep 8 2023, 3:14 PM

Move after codegenprepare, rebase, fix lack of an or constant

krzysz00 updated this revision to Diff 556591.Sep 12 2023, 10:34 AM

Rebase and update tests

piotr added a comment.EditedSep 13 2023, 3:28 AM

I have done the first round of testing of the patch on graphics content, with LLPC PatchBufferOp disabled and small modification in LLPC to use ptr addrspace (7) in lieu of lgc.buffer.desc.to.ptr.

Found 5 different issues, although 4 of them are in the isel and look related to the mentioned TODO ("break up large loads and stores into smaller parts"). The 5th one is in the pass itself (llvm/lib/IR/Value.cpp:507: void llvm::Value::doRAUW(llvm::Value *, llvm::Value::ReplaceMetadataUses): Assertion `!contains(New, this) && "this->replaceAllUsesWith(expr(this)) is NOT valid!"' failed.).
I will send you all reproducers offline.

I also noticed that s_buffer_load instruction is not generated anymore. In PatchBufferOp we have a logic that understands llvm.invariant.start intrinsic and generates s_buffer_load intrinsic accordingly. This is how the front-end marks a load from constant memory (side note: s_buffer_load can still end up being buffer_load during isel if offset is divergent, see SITargetLowering::lowerSBuffer). Do you have any plans to handle that use case in the new lowering? This is an important case for Vulkan applications using uniform blocks.

arsenm added inline comments.Sep 13 2023, 4:45 AM
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
2

Not padded out to same column as the later header lines

14

I wouldn't bother mentioning this and just not add the pass for r600

44

addrspace(6) is not just 32-bit, it's 32-bit constant. We shouldn't be assuming these are only const accesses. Should the offset just not be a pointer at all?

54
55

"handled by containing values"? Not sure what this means

64

where we 'loud'?

68
70–71

atomic load and store should work fine

80
108
166
241–242

What's the point in having these be separate?

Also not sure why these are virtual

260

This is just getIntPtrType

264

This is just getIntPtrType

284

Type *&

Could you also move all the core logic to a separate function, such that every return doesn't need to consider the assignment to the map?

299

dyn_cast? Also could make a predicate function and clarify what uniqued means

349

remove dead return

361

Don't really think CONSTANT_ADDRESS_32BIT is appropriate for this

413
588

can use {}?

1832

Could also create an anonymous function and use takeName. Are you getting an added suffix now?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
14627–14629

Probably should have an additional predicate function for this. It's separate but not really from global

llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-p7-in-memory.ll
35

Is this alignment correct? I thought p7 was 256 bit / align 32

162

Also load of array and load of unpacked struct?

arsenm added inline comments.Sep 13 2023, 4:50 AM
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
1825

Do you need a copyAttributesFrom?

arsenm added inline comments.Sep 13 2023, 4:51 AM
llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-calls.ll
80

Test some external declarations, and some cases with non-default linkage and attributes that should be preserved

krzysz00 updated this revision to Diff 557078.Sep 19 2023, 2:38 PM
krzysz00 marked 11 inline comments as done.

Move to i32 for offsets, fix function handling

krzysz00 marked an inline comment as done.Sep 19 2023, 2:39 PM
krzysz00 added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
241–242

I had them separate because, in the struct case, remapVector(T) isn't <NxremapScalar(T)>.

And they're virtual because we have a base class method that'll call these methods and we want it to call the right ones depending on which subclass we're working with.

284

Type *& starts throwing compile errors.

Also, I'm basing this off existing code, and I suspect the whole set-entry-on-return setup has to do with updating everything correctly when there's recursion (self-referential structs and the like).

llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-calls.ll
80

I think I got those?

krzysz00 updated this revision to Diff 557425.Sep 27 2023, 2:24 PM
krzysz00 marked an inline comment as done.

Rebase

krzysz00 updated this revision to Diff 557917.Oct 27 2023, 9:06 AM

Update ptrmask lowering and tests because of PR #69343

arsenm added inline comments.Oct 30 2023, 3:47 AM
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
25
38
299

not done?

679

Braces

685

Braces

689

Braces

1010
1019
1022
1024
1136

Don't think you're supposed to use a raw delete on any llvm value? Is there an erase method of some kind?

1176

What is this for? The safer thing to do would be record not volatile in metadata

krzysz00 updated this revision to Diff 557939.Oct 30 2023, 10:22 AM
krzysz00 marked 13 inline comments as done.

Address review comments

llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
299

Moved to a dyn_cast and added a comment - I don't think we need to create a whole function for this boolean

1176

On the one hand, that'd make sense. On the other hand, are buffer loads and stores currently volatile (and so we'd have metadata marking them as non-volatile) or would volatile by default be a breaking semantics change?

The point of this was to not lose information.

arsenm added inline comments.Oct 30 2023, 6:24 PM
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
1176

Assuming volatile by default is always conservatively correct. Since you technically can drop metadata, relying on it to preserve knowledge of volatileness is risky

krzysz00 added inline comments.Nov 15 2023, 9:55 AM
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
1176

Could we, perhaps, use the high bits of the flags constant for volatile / non-volatile (ex. set bit 31 if this buffer op was IR-level volatile)?

Or would it make sense - since the *.ptr.buffer.* intrinsics aren't widely used yet - to go back and add alignment and volatility arguments to them?

Or is there some other approach I'm missing?

piotr added inline comments.Mon, Nov 27, 2:31 AM
llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp
1176

I think all three options listed would work here (bits reinterpretation, separate args, inverted metadata). I would be slightly in favor of reinterpreting the flags bits for this, because it should be faster than inverted metadata.