This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Auto-generate tests for sufrace and texture instructions
ClosedPublic

Authored by asavonic on Nov 22 2021, 6:46 AM.

Details

Summary

The patch adds LIT tests for SULD, SUST, TEX and TLD4 instructions as a follow up for D112232.
There are a number of FIXME marks that highlight possible bugs or missed instruction variants.

Diff Detail

Event Timeline

asavonic created this revision.Nov 22 2021, 6:46 AM
asavonic requested review of this revision.Nov 22 2021, 6:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2021, 6:46 AM
tra added a comment.Nov 22 2021, 12:30 PM

Awsome! I'll go over the script in a couple of days.
Meanwhile, could you post a representative example of the code this file generates?

BTW, how many instruction variants does it generate and how long does it take to run the test?

Awsome! I'll go over the script in a couple of days.
Meanwhile, could you post a representative example of the code this file generates?

BTW, how many instruction variants does it generate and how long does it take to run the test?

Currently the test takes ~2 seconds on my machine.
There are 508 unique variants, but some are duplicated for cuda and nvcl triples, so there are total 847 test cases.
I think we can improve this, I'll post an update tomorrow.

target triple = "nvptx-unknown-cuda"

declare i64 @llvm.nvvm.texsurf.handle.internal.p1i64(i64 addrspace(1)*)
@gsurf = internal addrspace(1) global i64 0, align 8
@gtex = internal addrspace(1) global i64 1, align 8

;; access: i32 %x
;; instruction: suld.b.1d.b8.trap
;; retty: i16
;; reg_access: {%r{{[0-9]}}}
;; test_name: test_suld_1db8trap
;; global_surf: gsurf
;; intrinsic: llvm.nvvm.suld.1d.i8.trap
;; reg_surf: %rd{{[0-9]+}}
;; reg_ret: {%rs{{[0-9]+}}}

declare i16 @llvm.nvvm.suld.1d.i8.trap(i64 %s, i32 %x);

; CHECK-LABEL: .func test_suld_1db8trap_index
; CHECK: suld.b.1d.b8.trap {%rs{{[0-9]+}}}, [%rd{{[0-9]+}}, {%r{{[0-9]}}}]
;
define void @test_suld_1db8trap_index(i64 %s, i16* %ret, i32 %x) {
  %val = tail call i16 @llvm.nvvm.suld.1d.i8.trap(i64 %s, i32 %x)
  store i16 %val, i16* %ret
  ret void
}
; CHECK-LABEL: .func test_suld_1db8trap_global
; CHECK: suld.b.1d.b8.trap {%rs{{[0-9]+}}}, [gsurf, {%r{{[0-9]}}}]
;
define void @test_suld_1db8trap_global(i64 %s, i16* %ret, i32 %x) {
  %gs = tail call i64 @llvm.nvvm.texsurf.handle.internal.p1i64(i64 addrspace(1)* @gsurf)
  %val = tail call i16 @llvm.nvvm.suld.1d.i8.trap(i64 %gs, i32 %x)
  store i16 %val, i16* %ret
  ret void
}
tra added a comment.Nov 22 2021, 1:46 PM

Currently the test takes ~2 seconds on my machine.
There are 508 unique variants, but some are duplicated for cuda and nvcl triples, so there are total 847 test cases.

OK, this is short enough not to worry about splitting into separate tests.

I'm curious -- if you feed the output of llc on generated tests, will it be able to assemble it?
I've found a number of bugs that way while implementing *MMA instructions and tests.

asavonic added a comment.EditedNov 24 2021, 11:49 AM

There were several issues found by ptxas, mostly related to metadata which is missing in this patch.
Other issues are weird: it seems that code generated for non-unified mode does not pass ptxas unless I mark the function as a kernel.
I'll post more details and update the patch as soon as I figure out what is going on.

asavonic updated this revision to Diff 390069.Nov 26 2021, 8:14 AM
  • Added nvvm metadata for kernel, surface, texture and sampler globals and parameters.
  • Verified the generated code with ptxas.
  • Changed the generated code so that it conforms to both NVCL and CUDA environments.
tra added a comment.Nov 29 2021, 11:58 AM

Looks good overall. My main concern is with the test reaching out to LLVM sources and parsing tablegen files.

llvm/test/CodeGen/NVPTX/surf-tex.py
17

%S only points to the source directory for the test itself. It does not guarantee that anything outside of it is available.
E.g. a test could be run remotely, with only the test directory and the tested binaries being shipped.
While it may work within developer's sandbox, it will break for some users and buildbots.

We should probably disable this run or figure out how to incorporate the needed info w/o reaching out to the LLVM's source tree.
E.g. extract the list of instructions in tablegen files now and commit it as the input data for the test.
Parsing tablegen files themselves will be fragile anyways. If/when we eventually clean that copy/paste mess, we'd need to expand all the tablegen macros in order to figure out the list of instructions they will produce. We may as well scrip that now and just use the results for the tests.

112

nvptx64 would be a better choice as that's what we actually use.

191–193

Interesting. PTX spec is a bit vague about what's OK and not OK to do with function parameters. We do know that ld.param is the only way to read them. Taking address forces creation of a local copy. I wonder what surface lookup with direct use of parameter does under the hood -- treats it as a taken address and forces a local copy, too?

208–211

Can you elaborate? an example on godbolt.org would be helpful.

486–487

b32 does not prescribe any specific type, so f32 registers are fine to use, too. Having a mix of registers in what's notionally a vector does seem odd, but it does make sense for NVIDIA GPUs. The actual registers are type-agnostic and can be used for both FP and integer ops.

607

We didn't touch texture/surface intrinsics since the original code drop by NVIDIA. We are likely to be missing a bunch of newer ones, added by NVIDIA since then.
Whether it's worth bothering to add them is the question. Unless we have specific use case for them in LLVM (e.g. if we can use them to optimize something), it may be easier to just implement the instructions in clang's headers as inline asm.

asavonic updated this revision to Diff 392098.Dec 6 2021, 9:31 AM
  • Removed FIXMEs about mixed vector types.
  • Disabled --verify RUN line.
  • Changed nvptx triple to nvptx64
llvm/test/CodeGen/NVPTX/surf-tex.py
17

Agree. I added this to make sure that the script covers all intrinsics/instructions we currently have. Lets disable this RUN line for now.

112

Done.

191–193

PTX spec s5.3 "Texture Sampler and Surface Types" summarizes how surface/texture variables can be used, but does not include a lot of details on how they actually work.

Variable definition within global (module) scope and in kernel entry parameter lists.

It does not say this explicitly, but I *guess* that .surfref as a kernel parameter works the same as a global .surfref and therefore can be used directly.

Creating pointers to opaque variables using mov, e.g., mov.u64 reg, opaque_var;. The resulting pointer may be stored to and loaded from memory, passed as a parameter to functions, and de-referenced by texture and surface load, store, and query instructions, but the pointer cannot otherwise be treated as an address, i.e., accessing the pointer with ld and st instructions, or performing pointer arithmetic will result in undefined results.

However, when we take a "pointer" to a surface/texture as a register, it now works like any other type and needs an ld.param if it is passed to a function.

208–211

If has_surface_param is set, we emit the metadata that turns the first parameter into a .surfref instead of .u64. However, NVPTX specifically keeps ld.param instructions when compiling for CUDA target (see NVPTXReplaceImageHandles.cpp:1809), and ptxas does not accept this.

.visible .entry test_suld_1db8trap_param(
	.param .surfref test_suld_1db8trap_param_param_0,
	.param .u64 test_suld_1db8trap_param_param_1,
	.param .u32 test_suld_1db8trap_param_param_2
)
{
[...]
  // ptxas: error   : Illegal operand type to instruction 'ld'
  ld.param.u64 	%rd1, [test_suld_1db8trap_param_param_0];
[...]
  suld.b.1d.b8.trap {%rs1}, [%rd1, {%r1}];
[...]
}

By not emitting the metadata, we lower the argument as a .u64 and it can now be used with ld.param. I think this is considered to be an indirect access to a surface (PTX spec s5.3 Texture Sampler and Surface Types).

.visible .entry test_suld_1db8trap_param(
	.param .u64 test_suld_1db8trap_param_param_0,
	.param .u64 test_suld_1db8trap_param_param_1,
	.param .u32 test_suld_1db8trap_param_param_2
)
{
[...]
  ld.param.u64 	%rd1, [test_suld_1db8trap_param_param_0];
[...]
  suld.b.1d.b8.trap {%rs1}, [%rd1, {%r1}];
[...]
}

For NVCL environment the ld.param instruction is not generated and the .surfref parameter is used directly:

.entry test_suld_1db8trap_param(
	.param .surfref test_suld_1db8trap_param_param_0,
	.param .u64 .ptr .align 2 test_suld_1db8trap_param_param_1,
	.param .u32 test_suld_1db8trap_param_param_2
)
{
[...]
  suld.b.1d.b8.trap {%rs1}, [test_suld_1db8trap_param_param_0, {%r1}];
[...]
}

To be honest, I don't know if there is any reason for this difference between CUDA and NVCL handling. Maybe the check at NVPTXReplaceImageHandles.cpp:1809 is overly restrictive and can be improved to allow both variants (.surfref and .u64).

486–487

Thank you for clarifying this. Removed the FIXME.

607

I see. We can keep these FIXMEs here as a documentation on what is not supported.

tra accepted this revision.Dec 6 2021, 10:29 AM
tra added inline comments.
llvm/test/CodeGen/NVPTX/surf-tex.py
191–193

but the pointer cannot otherwise be treated as an address

Ugh. This part of PTX spec is such a pain. Adding special texture/surface reference magic to it does not help at all. :-(

We may need to add a special case for the references to the nvptx-lower-args pass. We currently allow some byval parameters to be accessed directly, but it does not know anything about textures/surfaces.

This revision is now accepted and ready to land.Dec 6 2021, 10:29 AM