Page MenuHomePhabricator

[AMX] Prototype for vector and amx bitcast.
Needs RevisionPublic

Authored by LuoYuanke on Mar 23 2021, 1:16 AM.

Details

Reviewers
lebedev.ri
Summary

Introduce new intrinsic to cast vector and amx. This can prevent
middle-end optimization on bitcast. However sometimes we need the
optimizaton for bitcast. For inner_product of amx_cast.c, we have to
deal with llvm.x86.vector.amx.cast.v256i32.x86amx by ourselves.

Diff Detail

Unit TestsFailed

TimeTest
520 msx64 debian > Clang.CodeGen/X86::amx_api.c
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/clang -cc1 -internal-isystem /mnt/disks/ssd0/agent/llvm-project/build/lib/clang/13.0.0/include -nostdsysteminc /mnt/disks/ssd0/agent/llvm-project/clang/test/CodeGen/X86/amx_api.c -flax-vector-conversions=none -ffreestanding -triple=x86_64-unknown-unknown -target-feature +avx512f -target-feature +amx-int8 -target-feature +amx-bf16 -emit-llvm -o - -Werror -pedantic | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/clang/test/CodeGen/X86/amx_api.c --check-prefixes=CHECK
830 msx64 windows > Clang.CodeGen/X86::amx_api.c
Script: -- : 'RUN: at line 1'; c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\clang.exe -cc1 -internal-isystem c:\ws\w16n2-1\llvm-project\premerge-checks\build\lib\clang\13.0.0\include -nostdsysteminc C:\ws\w16n2-1\llvm-project\premerge-checks\clang\test\CodeGen\X86\amx_api.c -flax-vector-conversions=none -ffreestanding -triple=x86_64-unknown-unknown -target-feature +avx512f -target-feature +amx-int8 -target-feature +amx-bf16 -emit-llvm -o - -Werror -pedantic | c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16n2-1\llvm-project\premerge-checks\clang\test\CodeGen\X86\amx_api.c --check-prefixes=CHECK

Event Timeline

LuoYuanke created this revision.Mar 23 2021, 1:16 AM
LuoYuanke requested review of this revision.Mar 23 2021, 1:16 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 23 2021, 1:16 AM

I'm a little bit lost with all this AMX stuff.
Could you please explain in normal human words, what does __tile_loadd() do?
I.e. given

void wrapper(__tile& dst, const void *base, int stride) {
  _tile_loadd(dst, const void *base, int stride);
}

which bytes from base will be loaded?

clang/test/CodeGen/X86/amx_cast.c
1

Please don't use llvm optimizations in clang tests.

4

The tests should be hermetic.
Which immintrin.h is being used there?
Hopefully the one from clang, not the one from system?

@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right. For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.

lebedev.ri requested changes to this revision.EditedMar 23 2021, 1:49 AM

@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right.

For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.

Aha. Finally. So i was right and you are hiding a bug.
Bitcast is a red herring.
You should not be producing a plain load, that is a miscompile already.
You should be either ideally producing LLVM IR's native tile load instruction (is there one? i thought there was one),
or introduce @llvm.x86.vector.amx.load and produce it instead.

Full stop. I'm going to block any further AMX patches until this is addressed.

This revision now requires changes to proceed.Mar 23 2021, 1:49 AM

@lebedev.ri, our goal is seeking a ideal solution, not arguing who is right. I hope there is no bias during the discussion. I hope Florian and James set a role model for you. They are trying to understand the problem and helping solve the problem. I don't know if it is the right way to stop other's patch based on your own preference.

load instruction loads contigious bytes.
If that is not what is AMX is trying to use it for, then it is being used incorrectly.

load instruction loads contigious bytes.
If that is not what is AMX is trying to use it for, then it is being used incorrectly.

Isn't it a reason that we can't cast "load <256 x i32>*" to "load x86_amx*"? The load memory for <256 x i32> is contiguous, and load memory for x86_amx is not contiguous.

load instruction loads contigious bytes.
If that is not what is AMX is trying to use it for, then it is being used incorrectly.

Isn't it a reason that we can't cast "load <256 x i32>*" to "load x86_amx*"? The load memory for <256 x i32> is contiguous, and load memory for x86_amx is not contiguous.

To be honest i don't really understand why x86_amx type is even there.
It seems to me that if you just directly used @llvm.x86.tileloadd64.internal / @llvm.x86.tilestored64.internal,
and s/x86_amx/<256 x i32>/, none of these problems would be here.

To be honest i don't really understand why x86_amx type is even there.
It seems to me that if you just directly used @llvm.x86.tileloadd64.internal / @llvm.x86.tilestored64.internal,
and s/x86_amx/<256 x i32>/, none of these problems would be here.

I explained in llvm-dev. I copy the content below.

Bitcasts is introduced by the frontend call amx intrinsics. We use vector to represent 2D amx tile in C language, on the other hand we don’t want to mix our amx tile to other vector operation, so x86_amx is introduced to isolate amx intrinsics from normal vector operation. The bitcast is to monitor that a normal vector is passed to amx intrinsics. In below example, we need to transform the bitcast to a vector store and an amx load intrinsic. The x86_amx* is unexpected at the beginning, but in the pass of InstrCombine the middle-end generate the x86_amx pointer.

define dso_local void @test_src_add(<256 x i32> %x, <256 x i32> %y, i16 %r, i16 %c, i8* %buf, i64 %s) {
; CHECK-LABEL: @test_src_add(
; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.*]] = alloca <256 x i32>, align 64
; CHECK-NEXT: [[ADD:%.*]] = add <256 x i32> [[Y:%.*]], [[X:%.*]]
; CHECK-NEXT: [[TMP1:%.*]] = bitcast <256 x i32>* [[TMP0]] to i8*
; CHECK-NEXT: store <256 x i32> [[ADD]], <256 x i32>* [[TMP0]], align 1024
; CHECK-NEXT: [[TMP2:%.*]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[R:%.*]], i16 [[C:%.*]], i8* [[TMP1]], i64 64)
; CHECK-NEXT: call void @llvm.x86.tilestored64.internal(i16 [[R]], i16 [[C]], i8* [[BUF:%.*]], i64 [[S:%.*]], x86_amx [[TMP2]])
; CHECK-NEXT: ret void
;
entry:

%add = add <256 x i32> %y, %x
%t = bitcast <256 x i32> %add to x86_amx
call void @llvm.x86.tilestored64.internal(i16 %r, i16 %c, i8* %buf, i64 %s, x86_amx %t)
ret void

}

fhahn added a subscriber: fhahn.Mar 24 2021, 3:27 AM

To be honest i don't really understand why x86_amx type is even there.
It seems to me that if you just directly used @llvm.x86.tileloadd64.internal / @llvm.x86.tilestored64.internal,
and s/x86_amx/<256 x i32>/, none of these problems would be here.

I explained in llvm-dev. I copy the content below.

Bitcasts is introduced by the frontend call amx intrinsics. We use vector to represent 2D amx tile in C language, on the other hand we don’t want to mix our amx tile to other vector operation, so x86_amx is introduced to isolate amx intrinsics from normal vector operation. The bitcast is to monitor that a normal vector is passed to amx intrinsics. In below example, we need to transform the bitcast to a vector store and an amx load intrinsic. The x86_amx* is unexpected at the beginning, but in the pass of InstrCombine the middle-end generate the x86_amx pointer.

entry:

%add = add <256 x i32> %y, %x
%t = bitcast <256 x i32> %add to x86_amx
call void @llvm.x86.tilestored64.internal(i16 %r, i16 %c, i8* %buf, i64 %s, x86_amx %t)
ret void

IIUC you need this to transfer/convert data from a consecutive vector to an AMX tile. To express that, emitting an intrinsic for the conversion instead a bit cast seems the right thing to me.

IIUC Roman was saying that from that example alone it is not clear why the explicit conversion in IR is actually needed (please correct me if I am wrong). For the example, you *could* have a version of llvm.x86.tilestored64.internal that takes an <256 x i32> and does the conversion internally. Having a separate intrinsic to do the conversion gives greater composability in the IR, but I think at the moment it is hard to judge if that is needed, because it is not easy to get an overview of all AMX operations that need support. Is there a summary/documentation of the AMX builtins supported in Clang?

With respect to the load issue, it is not clear to me at the moment under which circumstances regular load instructions are generated & interact with AMX. If load is used to load x consecutive elements, than that's fine. But if the actual intended operation is a strided load, then load should not be used (this has also been discussed on llvm-dev).

IIUC you need this to transfer/convert data from a consecutive vector to an AMX tile. To express that, emitting an intrinsic for the conversion instead a bit cast seems the right thing to me.

Yes. We need to transfer/convert data from a consecutive vector to an AMX tile. Because in the C language interface the tile defined as vector. typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); Take below code (https://gcc.godbolt.org/z/noaWEWd6n) as an example.

#include <immintrin.h>

char buf[1024];
void foo() {
  _tile1024i tile;
  tile = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
}

Compile it with "clang -S -emit-llvm simple_amx.c -mamx-int8" we got below IR.

define dso_local void @foo() #0 !dbg !15 {
  %1 = alloca <256 x i32>, align 64
  call void @llvm.dbg.declare(metadata <256 x i32>* %1, metadata !18, metadata !DIExpression()), !dbg !25
  %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64), !dbg !26
  %3 = bitcast x86_amx %2 to <256 x i32>, !dbg !26
  store <256 x i32> %3, <256 x i32>* %1, align 64, !dbg !27
  ret void, !dbg !28
}

Front-end alloca <256 x i32> for the local variable tile. When the return value of builtin_ia32_tileloadd64_internal is assigned to tile. Front-end bitcast x86_amx to <256 x i32>. The x86_amx is the type returned from builtin_ia32_tileloadd64_internal.

IIUC Roman was saying that from that example alone it is not clear why the explicit conversion in IR is actually needed (please correct me if I am wrong). For the example, you *could* have a version of llvm.x86.tilestored64.internal that takes an <256 x i32> and does the conversion internally. Having a separate intrinsic to do the conversion gives greater composability in the IR, but I think at the moment it is hard to judge if that is needed, because it is not easy to get an overview of all AMX operations that need support. Is there a summary/documentation of the AMX builtins supported in Clang?

I plan to add AMX operation to Clang doc when the AMX support in LLVM is stable. There are only load/store, zero, dotproduct operations for AMX. We don't have full ISA support to matrix operation.

__builtin_ia32_tileloadd64_internal
__builtin_ia32_tdpbssd_internal
__builtin_ia32_tilestored64_internal
__builtin_ia32_tilezero_internal

With respect to the load issue, it is not clear to me at the moment under which circumstances regular load instructions are generated & interact with AMX. If load is used to load x consecutive elements, than that's fine. But if the actual intended operation is a strided load, then load should not be used (this has also been discussed on llvm-dev).

The load instructions are generated because it is a vector in C language. See https://gcc.godbolt.org/z/qv5jnjK48. If we use -O0, there is load instruction generated. If we use -O2, the load instruction is eliminated. The -O2 version is what we want. There is no <256 x i32> in the generated code.

fhahn added a comment.Mar 24 2021, 7:44 AM

Front-end alloca <256 x i32> for the local variable tile. When the return value of builtin_ia32_tileloadd64_internal is assigned to tile. Front-end bitcast x86_amx to <256 x i32>. The x86_amx is the type returned from builtin_ia32_tileloadd64_internal.

Can you share a more interesting example, where the result of the load is actually used by a different AMX builtin? For the store example, it seems like conversion intrinsic + regular IR store should work.

With respect to the load issue, it is not clear to me at the moment under which circumstances regular load instructions are generated & interact with AMX. If load is used to load x consecutive elements, than that's fine. But if the actual intended operation is a strided load, then load should not be used (this has also been discussed on llvm-dev).

The load instructions are generated because it is a vector in C language. See https://gcc.godbolt.org/z/qv5jnjK48. If we use -O0, there is load instruction generated. If we use -O2, the load instruction is eliminated. The -O2 version is what we want. There is no <256 x i32> in the generated code.

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

LuoYuanke added a comment.EditedMar 24 2021, 7:41 PM

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

fhahn added a comment.Mar 29 2021, 3:14 AM

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?

I'm not sure. I think first and foremost the load/store miscompile should be addressed.
I think the rest is confusing because it seems to me that the only reason why that bitcast
is needed is not correctness reason, but as an opaque optimization barrier.

fhahn added a comment.Mar 29 2021, 3:44 AM

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?

I'm not sure. I think first and foremost the load/store miscompile should be addressed.
I think the rest is confusing because it seems to me that the only reason why that bitcast
is needed is not correctness reason, but as an opaque optimization barrier.

I'm not sure if the loads and store are actually incorrect. _tile1024i is defined as typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); and the loads/stores are for assignments/reads from variables that have that type, which is <256 x i32> in IR. So it is not obvious to me why the loads/stores would be wrong, as long as _tile1024i is defined as it is (if it would be a different type, that all changes).

As a consequence, __builtin_ia32_tilezero_internal & the other builtins need to be defined as returning _tile1024i / <256 x i32>. I don't think there's any other way to specify this, unless you have a dedicated AMX type in the frontend. IIUC the current lowering is to call an intrinsic that returns x86_amx and then a bitcast is used for the conversion to the result type <256 x i32>, with the (incorrect) assumption that the bitcast does complex conversion between types. Another consequence of the builtins returning _tile1024i / <256 x i32> is that the conversion from the intrinsic result to <256 x i32> should happen at the place where Clang emits the call to the intrinsic directly, not patched up later as it is done now.

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?

I'm not sure. I think first and foremost the load/store miscompile should be addressed.
I think the rest is confusing because it seems to me that the only reason why that bitcast
is needed is not correctness reason, but as an opaque optimization barrier.

I'm not sure if the loads and store are actually incorrect. _tile1024i is defined as typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); and the loads/stores are for assignments/reads from variables that have that type, which is <256 x i32> in IR. So it is not obvious to me why the loads/stores would be wrong, as long as _tile1024i is defined as it is (if it would be a different type, that all changes).

Didn't @LuoYuanke state:

@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right. For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.

Note the So the data is not contiguous in memory..
I.e. we won't actually store to i8* [ptr + 0, ptr + 4096) byte area.
Is plain load/store not defined to perform operations on contigious chunks of memory?
Am i completely missing the point?

As a consequence, __builtin_ia32_tilezero_internal & the other builtins need to be defined as returning _tile1024i / <256 x i32>. I don't think there's any other way to specify this, unless you have a dedicated AMX type in the frontend. IIUC the current lowering is to call an intrinsic that returns x86_amx and then a bitcast is used for the conversion to the result type <256 x i32>, with the (incorrect) assumption that the bitcast does complex conversion between types. Another consequence of the builtins returning _tile1024i / <256 x i32> is that the conversion from the intrinsic result to <256 x i32> should happen at the place where Clang emits the call to the intrinsic directly, not patched up later as it is done now.

/store

fhahn added a comment.Mar 29 2021, 4:24 AM

snip

I'm not sure if the loads and store are actually incorrect. _tile1024i is defined as typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); and the loads/stores are for assignments/reads from variables that have that type, which is <256 x i32> in IR. So it is not obvious to me why the loads/stores would be wrong, as long as _tile1024i is defined as it is (if it would be a different type, that all changes).

Didn't @LuoYuanke state:

@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right. For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.

Note the So the data is not contiguous in memory..
I.e. we won't actually store to i8* [ptr + 0, ptr + 4096) byte area.
Is plain load/store not defined to perform operations on contigious chunks of memory?
Am i completely missing the point?

I think that point was not really clear during the discussion. Using load <256 x i32> to lower __tile_loadd() would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j

void foo() {
  tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
}

is lowered to

define dso_local void @foo() #0 {
  %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64)
  %2 = bitcast x86_amx %1 to <256 x i32>
  store <256 x i32> %2, <256 x i32>* @tilea, align 64
  ret void
}

So we emit an intrinsic to do the strided load and the result is stored to continuous memory, which is what the type _tile1024i requires. What's not modeled correctly is the conversion between the result of @llvm.x86.tileloadd64.internal and the store. It needs to be transferred in a flat vector.

Whether we should have x86_amx in the first place is a separate question I think. Having a builtin type that does not work properly with fundamental instructions like load or store seems prone for errors (what instructions actually work with x86_amx? Do binary operators work?). Perhaps it would be possible and sufficient to have the intrinsics use an opaque type instead of a builtin type, like

%my_x86_amx = type opaque

define %my_x86_amx @foo(%my_x86_amx %x) {
  ret %my_x86_amx %x
}

But I think we should address those 2 issues separately and fix the biggest problem (mis-use of bitcast) first, perhaps followed up by verifier rules rejecting x86_amx from un-suitable instructions and go from there.

LuoYuanke edited subscribers, added: andrew.w.kaylor; removed: kbsmith1.

I think that point was not really clear during the discussion. Using load <256 x i32> to lower __tile_loadd() would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j

The load/store <256 x i32> is generated by front-end, because in C language tile is a vector <256 x i32>. The load/store <256 x i32> is transformed to llvm.x86.tileloadd64.internal/llvm.x86.tilestored64.internal in lib/Target/X86/X86LowerAMXType.cpp if the load result is to be an operand of amx intrinsics or the store value is returned from amx intrinsics.

void foo() {
  tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
}

is lowered to

define dso_local void @foo() #0 {
  %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64)
  %2 = bitcast x86_amx %1 to <256 x i32>
  store <256 x i32> %2, <256 x i32>* @tilea, align 64
  ret void
}

So we emit an intrinsic to do the strided load and the result is stored to continuous memory, which is what the type _tile1024i requires. What's not modeled correctly is the conversion between the result of @llvm.x86.tileloadd64.internal and the store. It needs to be transferred in a flat vector.

Yes. I agree that it needs to be transferred in a flat vector.

Whether we should have x86_amx in the first place is a separate question I think. Having a builtin type that does not work properly with fundamental instructions like load or store seems prone for errors (what instructions actually work with x86_amx? Do binary operators work?). Perhaps it would be possible and sufficient to have the intrinsics use an opaque type instead of a builtin type, like

We only support tileload, tilestore, tilezero, tiletdp (dot product) instructions/intrinsics for x86_amx. Is there any opaque type example llvm source code for builtin? This example has some error at https://gcc.godbolt.org/z/ar6WhjTMz.

%my_x86_amx = type opaque

define %my_x86_amx @foo(%my_x86_amx %x) {
  ret %my_x86_amx %x
}

But I think we should address those 2 issues separately and fix the biggest problem (mis-use of bitcast) first, perhaps followed up by verifier rules rejecting x86_amx from un-suitable instructions and go from there.

I may further implement this patch and transform/eliminate @llvm.x86.vector.amx.cast in lib/Target/X86/X86LowerAMXType.cpp which is before codegen. There is some effort to implement it, but I'd like to take a try.

fhahn added a comment.Mar 29 2021, 7:04 AM

I think that point was not really clear during the discussion. Using load <256 x i32> to lower __tile_loadd() would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j

The load/store <256 x i32> is generated by front-end, because in C language tile is a vector <256 x i32>. The load/store <256 x i32> is transformed to llvm.x86.tileloadd64.internal/llvm.x86.tilestored64.internal in lib/Target/X86/X86LowerAMXType.cpp if the load result is to be an operand of amx intrinsics or the store value is returned from amx intrinsics.

Sure, you can get rid of unnecessary conversions/loads/stores during optimizations, if you can operate on AMX tiles directly and do not need to store the intermediate results in flat vectors. You can also use strided loads/stores to store continuous memory (no stride between columns/rows). But what optimizations are applied later do not impact the whether IR emitted by Clang is correct or now. It always needs to be correct. Whether to further optimizations are correct is a different problem, but we need a specification for the builtins, intrinsics and the type before going any further in that direction.

void foo() {
  tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
}

is lowered to

define dso_local void @foo() #0 {
  %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64)
  %2 = bitcast x86_amx %1 to <256 x i32>
  store <256 x i32> %2, <256 x i32>* @tilea, align 64
  ret void
}

So we emit an intrinsic to do the strided load and the result is stored to continuous memory, which is what the type _tile1024i requires. What's not modeled correctly is the conversion between the result of @llvm.x86.tileloadd64.internal and the store. It needs to be transferred in a flat vector.

Yes. I agree that it needs to be transferred in a flat vector.

Whether we should have x86_amx in the first place is a separate question I think. Having a builtin type that does not work properly with fundamental instructions like load or store seems prone for errors (what instructions actually work with x86_amx? Do binary operators work?). Perhaps it would be possible and sufficient to have the intrinsics use an opaque type instead of a builtin type, like

We only support tileload, tilestore, tilezero, tiletdp (dot product) instructions/intrinsics for x86_amx. Is there any opaque type example llvm source code for builtin? This example has some error at https://gcc.godbolt.org/z/ar6WhjTMz.

I think you need to set the input to LLVM IR: https://gcc.godbolt.org/z/WexMjsas9

You should be able to use opaque types with overloaded intrinsics. I don't think you define an intrinsic to take a specific opaque type (because it's not known up front).

Whether to further optimizations are correct is a different problem, but we need a specification for the builtins, intrinsics and the type before going any further in that direction.

I think you need to set the input to LLVM IR: https://gcc.godbolt.org/z/WexMjsas9

You should be able to use opaque types with overloaded intrinsics. I don't think you define an intrinsic to take a specific opaque type (because it's not known up front).

The opaque type (https://llvm.org/docs/LangRef.html#opaque-structure-types) is pretty new to me. I didn't find any example for the opaque type in builtins or intrinsics. I am appreciated if you would write an example code (maybe tilezero) for the builtins, intrinsics and the type, so that I can understand it well. If we use <256 x i32> in builtins and use x86_amx in intrinsics, and have an specific intrinsics to covert x86_amx to flat vector <256 x i32>, I am able to do it by myself.

fhahn added a comment.Mar 31 2021, 1:20 AM

Whether to further optimizations are correct is a different problem, but we need a specification for the builtins, intrinsics and the type before going any further in that direction.

I think you need to set the input to LLVM IR: https://gcc.godbolt.org/z/WexMjsas9

You should be able to use opaque types with overloaded intrinsics. I don't think you define an intrinsic to take a specific opaque type (because it's not known up front).

The opaque type (https://llvm.org/docs/LangRef.html#opaque-structure-types) is pretty new to me. I didn't find any example for the opaque type in builtins or intrinsics. I am appreciated if you would write an example code (maybe tilezero) for the builtins, intrinsics and the type, so that I can understand it well. If we use <256 x i32> in builtins and use x86_amx in intrinsics, and have an specific intrinsics to covert x86_amx to flat vector <256 x i32>, I am able to do it by myself.

Unfortunately this is not possible to use an opaque type with the AMX intrinsics at the moment, because of the way they are define. It is possible to use opaque types with intrinsics in general though, e.g. see https://llvm.godbolt.org/z/Ezhf6535c

My point is, you should be able to adjust the definitions of the AMX intrinsics and then just replace all occurrences of x86_amx in your examples with a opaque type you define in the module. But as I said initially, you don't need to do everything at once (and you probably shouldn't). I'd start with addressing the bitcast issue and tackle the x86_amx type itself once that is done.

(And I am also not saying that it definitely needs to be removed, only that if it should be kept in the long run, it would be good to specify it in the LangRef and should have a good justification, especially if there are no instructions that do anything meaningful with values of the type other than take it as arguments and return values. Opaque types are a suggestion for an alternative that *may* be viable without a dedicated first-class type)

Unfortunately this is not possible to use an opaque type with the AMX intrinsics at the moment, because of the way they are define. It is possible to use opaque types with intrinsics in general though, e.g. see https://llvm.godbolt.org/z/Ezhf6535c

My point is, you should be able to adjust the definitions of the AMX intrinsics and then just replace all occurrences of x86_amx in your examples with a opaque type you define in the module. But as I said initially, you don't need to do everything at once (and you probably shouldn't). I'd start with addressing the bitcast issue and tackle the x86_amx type itself once that is done.

(And I am also not saying that it definitely needs to be removed, only that if it should be kept in the long run, it would be good to specify it in the LangRef and should have a good justification, especially if there are no instructions that do anything meaningful with values of the type other than take it as arguments and return values. Opaque types are a suggestion for an alternative that *may* be viable without a dedicated first-class type)

Thank you for the suggestion. So here is my plan.

  1. specify x86_amx in LangRef.
  2. Add llvm.x86.tile.cast intrinsic.
  3. Optimize some of llvm.x86.tile.cast code as bitcast does, and transform llvm.x86.tile.cast to amx intrinsic if it can't be eliminated.
  4. After the above 3 items are finished, replace bitcast with llvm.x86.tile.cast in front-end when generate IR for amx builtin.
  5. After some time for stabilization, remove bitcast transform code from LLVM.
  6. After all of the llvm.x86.tile.cast work is finished, let's discuss about opaque type.

Does that looks good to you?

fhahn added a comment.Mar 31 2021, 1:56 PM

Unfortunately this is not possible to use an opaque type with the AMX intrinsics at the moment, because of the way they are define. It is possible to use opaque types with intrinsics in general though, e.g. see https://llvm.godbolt.org/z/Ezhf6535c

My point is, you should be able to adjust the definitions of the AMX intrinsics and then just replace all occurrences of x86_amx in your examples with a opaque type you define in the module. But as I said initially, you don't need to do everything at once (and you probably shouldn't). I'd start with addressing the bitcast issue and tackle the x86_amx type itself once that is done.

(And I am also not saying that it definitely needs to be removed, only that if it should be kept in the long run, it would be good to specify it in the LangRef and should have a good justification, especially if there are no instructions that do anything meaningful with values of the type other than take it as arguments and return values. Opaque types are a suggestion for an alternative that *may* be viable without a dedicated first-class type)

Thank you for the suggestion. So here is my plan.

  1. specify x86_amx in LangRef.
  2. Add llvm.x86.tile.cast intrinsic.
  3. Optimize some of llvm.x86.tile.cast code as bitcast does, and transform llvm.x86.tile.cast to amx intrinsic if it can't be eliminated.
  4. After the above 3 items are finished, replace bitcast with llvm.x86.tile.cast in front-end when generate IR for amx builtin.
  5. After some time for stabilization, remove bitcast transform code from LLVM.
  6. After all of the llvm.x86.tile.cast work is finished, let's discuss about opaque type.

Does that looks good to you?

Sounds good to me, but it might be good to also share this in the thread on llvm-dev, as there might be additional feedback :)