This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
-
CGBuiltin.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
2
amx_cast.c
-
llvm/include/llvm/IR/
-
include/
-
llvm/
-
IR/
-
IntrinsicsX86.td

Differential D99152

[AMX] Prototype for vector and amx bitcast.
AbandonedPublic

Authored by LuoYuanke on Mar 23 2021, 1:16 AM.

Download Raw Diff

Details

Reviewers

lebedev.ri

Summary

Introduce new intrinsic to cast vector and amx. This can prevent
middle-end optimization on bitcast. However sometimes we need the
optimizaton for bitcast. For inner_product of amx_cast.c, we have to
deal with llvm.x86.vector.amx.cast.v256i32.x86amx by ourselves.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	520 ms	x64 debian > Clang.CodeGen/X86::amx_api.c
	830 ms	x64 windows > Clang.CodeGen/X86::amx_api.c

Event Timeline

LuoYuanke created this revision.Mar 23 2021, 1:16 AM

Herald added a subscriber: pengfei. · View Herald TranscriptMar 23 2021, 1:16 AM

LuoYuanke requested review of this revision.Mar 23 2021, 1:16 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 23 2021, 1:16 AM

Herald added subscribers: llvm-commits, cfe-commits. · View Herald Transcript

I'm a little bit lost with all this AMX stuff.
Could you please explain in normal human words, what does __tile_loadd() do?
I.e. given

void wrapper(__tile& dst, const void *base, int stride) {
  _tile_loadd(dst, const void *base, int stride);
}

which bytes from base will be loaded?

clang/test/CodeGen/X86/amx_cast.c
1	Please don't use llvm optimizations in clang tests.
4	The tests should be hermetic. Which `immintrin.h` is being used there? Hopefully the one from clang, not the one from system?

@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right. For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.

In D99152#2643821, @LuoYuanke wrote:

@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right.

For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.

Aha. Finally. So i was right and you are hiding a bug.
Bitcast is a red herring.
You should not be producing a plain load, that is a miscompile already.
You should be either ideally producing LLVM IR's native tile load instruction (is there one? i thought there was one),
or introduce @llvm.x86.vector.amx.load and produce it instead.

Full stop. I'm going to block any further AMX patches until this is addressed.

This revision now requires changes to proceed.Mar 23 2021, 1:49 AM

Harbormaster completed remote builds in B95185: Diff 332553.Mar 23 2021, 2:07 AM

@lebedev.ri, our goal is seeking a ideal solution, not arguing who is right. I hope there is no bias during the discussion. I hope Florian and James set a role model for you. They are trying to understand the problem and helping solve the problem. I don't know if it is the right way to stop other's patch based on your own preference.

load instruction loads contigious bytes.
If that is not what is AMX is trying to use it for, then it is being used incorrectly.

In D99152#2644017, @lebedev.ri wrote:

load instruction loads contigious bytes.
If that is not what is AMX is trying to use it for, then it is being used incorrectly.

Isn't it a reason that we can't cast "load <256 x i32>*" to "load x86_amx*"? The load memory for <256 x i32> is contiguous, and load memory for x86_amx is not contiguous.

In D99152#2644171, @LuoYuanke wrote:

In D99152#2644017, @lebedev.ri wrote:

load instruction loads contigious bytes.
If that is not what is AMX is trying to use it for, then it is being used incorrectly.

Isn't it a reason that we can't cast "load <256 x i32>*" to "load x86_amx*"? The load memory for <256 x i32> is contiguous, and load memory for x86_amx is not contiguous.

To be honest i don't really understand why x86_amx type is even there.
It seems to me that if you just directly used @llvm.x86.tileloadd64.internal / @llvm.x86.tilestored64.internal,
and s/x86_amx/<256 x i32>/, none of these problems would be here.

To be honest i don't really understand why x86_amx type is even there.
It seems to me that if you just directly used @llvm.x86.tileloadd64.internal / @llvm.x86.tilestored64.internal,
and s/x86_amx/<256 x i32>/, none of these problems would be here.

I explained in llvm-dev. I copy the content below.

Bitcasts is introduced by the frontend call amx intrinsics. We use vector to represent 2D amx tile in C language, on the other hand we don’t want to mix our amx tile to other vector operation, so x86_amx is introduced to isolate amx intrinsics from normal vector operation. The bitcast is to monitor that a normal vector is passed to amx intrinsics. In below example, we need to transform the bitcast to a vector store and an amx load intrinsic. The x86_amx* is unexpected at the beginning, but in the pass of InstrCombine the middle-end generate the x86_amx pointer.

define dso_local void @test_src_add(<256 x i32> %x, <256 x i32> %y, i16 %r, i16 %c, i8* %buf, i64 %s) {
; CHECK-LABEL: @test_src_add(
; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.*]] = alloca <256 x i32>, align 64
; CHECK-NEXT: [[ADD:%.*]] = add <256 x i32> [[Y:%.*]], [[X:%.*]]
; CHECK-NEXT: [[TMP1:%.*]] = bitcast <256 x i32>* [[TMP0]] to i8*
; CHECK-NEXT: store <256 x i32> [[ADD]], <256 x i32>* [[TMP0]], align 1024
; CHECK-NEXT: [[TMP2:%.*]] = call x86_amx @llvm.x86.tileloadd64.internal(i16 [[R:%.*]], i16 [[C:%.*]], i8* [[TMP1]], i64 64)
; CHECK-NEXT: call void @llvm.x86.tilestored64.internal(i16 [[R]], i16 [[C]], i8* [[BUF:%.*]], i64 [[S:%.*]], x86_amx [[TMP2]])
; CHECK-NEXT: ret void
;
entry:

%add = add <256 x i32> %y, %x
%t = bitcast <256 x i32> %add to x86_amx
call void @llvm.x86.tilestored64.internal(i16 %r, i16 %c, i8* %buf, i64 %s, x86_amx %t)
ret void

}

In D99152#2644373, @LuoYuanke wrote:
To be honest i don't really understand why x86_amx type is even there.
It seems to me that if you just directly used @llvm.x86.tileloadd64.internal / @llvm.x86.tilestored64.internal,
and s/x86_amx/<256 x i32>/, none of these problems would be here.

I explained in llvm-dev. I copy the content below.

Bitcasts is introduced by the frontend call amx intrinsics. We use vector to represent 2D amx tile in C language, on the other hand we don’t want to mix our amx tile to other vector operation, so x86_amx is introduced to isolate amx intrinsics from normal vector operation. The bitcast is to monitor that a normal vector is passed to amx intrinsics. In below example, we need to transform the bitcast to a vector store and an amx load intrinsic. The x86_amx* is unexpected at the beginning, but in the pass of InstrCombine the middle-end generate the x86_amx pointer.

entry:
%add = add <256 x i32> %y, %x
%t = bitcast <256 x i32> %add to x86_amx
call void @llvm.x86.tilestored64.internal(i16 %r, i16 %c, i8* %buf, i64 %s, x86_amx %t)
ret void

IIUC you need this to transfer/convert data from a consecutive vector to an AMX tile. To express that, emitting an intrinsic for the conversion instead a bit cast seems the right thing to me.

IIUC Roman was saying that from that example alone it is not clear why the explicit conversion in IR is actually needed (please correct me if I am wrong). For the example, you *could* have a version of llvm.x86.tilestored64.internal that takes an <256 x i32> and does the conversion internally. Having a separate intrinsic to do the conversion gives greater composability in the IR, but I think at the moment it is hard to judge if that is needed, because it is not easy to get an overview of all AMX operations that need support. Is there a summary/documentation of the AMX builtins supported in Clang?

With respect to the load issue, it is not clear to me at the moment under which circumstances regular load instructions are generated & interact with AMX. If load is used to load x consecutive elements, than that's fine. But if the actual intended operation is a strided load, then load should not be used (this has also been discussed on llvm-dev).

IIUC you need this to transfer/convert data from a consecutive vector to an AMX tile. To express that, emitting an intrinsic for the conversion instead a bit cast seems the right thing to me.

Yes. We need to transfer/convert data from a consecutive vector to an AMX tile. Because in the C language interface the tile defined as vector. typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); Take below code (https://gcc.godbolt.org/z/noaWEWd6n) as an example.

#include <immintrin.h>

char buf[1024];
void foo() {
  _tile1024i tile;
  tile = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
}

Compile it with "clang -S -emit-llvm simple_amx.c -mamx-int8" we got below IR.

define dso_local void @foo() #0 !dbg !15 {
  %1 = alloca <256 x i32>, align 64
  call void @llvm.dbg.declare(metadata <256 x i32>* %1, metadata !18, metadata !DIExpression()), !dbg !25
  %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64), !dbg !26
  %3 = bitcast x86_amx %2 to <256 x i32>, !dbg !26
  store <256 x i32> %3, <256 x i32>* %1, align 64, !dbg !27
  ret void, !dbg !28
}

Front-end alloca <256 x i32> for the local variable tile. When the return value of builtin_ia32_tileloadd64_internal is assigned to tile. Front-end bitcast x86_amx to <256 x i32>. The x86_amx is the type returned from builtin_ia32_tileloadd64_internal.

IIUC Roman was saying that from that example alone it is not clear why the explicit conversion in IR is actually needed (please correct me if I am wrong). For the example, you *could* have a version of llvm.x86.tilestored64.internal that takes an <256 x i32> and does the conversion internally. Having a separate intrinsic to do the conversion gives greater composability in the IR, but I think at the moment it is hard to judge if that is needed, because it is not easy to get an overview of all AMX operations that need support. Is there a summary/documentation of the AMX builtins supported in Clang?

I plan to add AMX operation to Clang doc when the AMX support in LLVM is stable. There are only load/store, zero, dotproduct operations for AMX. We don't have full ISA support to matrix operation.

__builtin_ia32_tileloadd64_internal
__builtin_ia32_tdpbssd_internal
__builtin_ia32_tilestored64_internal
__builtin_ia32_tilezero_internal

With respect to the load issue, it is not clear to me at the moment under which circumstances regular load instructions are generated & interact with AMX. If load is used to load x consecutive elements, than that's fine. But if the actual intended operation is a strided load, then load should not be used (this has also been discussed on llvm-dev).

The load instructions are generated because it is a vector in C language. See https://gcc.godbolt.org/z/qv5jnjK48. If we use -O0, there is load instruction generated. If we use -O2, the load instruction is eliminated. The -O2 version is what we want. There is no <256 x i32> in the generated code.

Front-end alloca <256 x i32> for the local variable tile. When the return value of builtin_ia32_tileloadd64_internal is assigned to tile. Front-end bitcast x86_amx to <256 x i32>. The x86_amx is the type returned from builtin_ia32_tileloadd64_internal.

Can you share a more interesting example, where the result of the load is actually used by a different AMX builtin? For the store example, it seems like conversion intrinsic + regular IR store should work.

With respect to the load issue, it is not clear to me at the moment under which circumstances regular load instructions are generated & interact with AMX. If load is used to load x consecutive elements, than that's fine. But if the actual intended operation is a strided load, then load should not be used (this has also been discussed on llvm-dev).

The load instructions are generated because it is a vector in C language. See https://gcc.godbolt.org/z/qv5jnjK48. If we use -O0, there is load instruction generated. If we use -O2, the load instruction is eliminated. The -O2 version is what we want. There is no <256 x i32> in the generated code.

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

In D99152#2647681, @fhahn wrote:

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

In D99152#2649520, @LuoYuanke wrote:

In D99152#2647681, @fhahn wrote:

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?

In D99152#2655274, @fhahn wrote:

In D99152#2649520, @LuoYuanke wrote:

In D99152#2647681, @fhahn wrote:

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?

I'm not sure. I think first and foremost the load/store miscompile should be addressed.
I think the rest is confusing because it seems to me that the only reason why that bitcast
is needed is not correctness reason, but as an opaque optimization barrier.

In D99152#2655304, @lebedev.ri wrote:

In D99152#2655274, @fhahn wrote:

In D99152#2649520, @LuoYuanke wrote:

In D99152#2647681, @fhahn wrote:

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?

I'm not sure. I think first and foremost the load/store miscompile should be addressed.
I think the rest is confusing because it seems to me that the only reason why that bitcast
is needed is not correctness reason, but as an opaque optimization barrier.

I'm not sure if the loads and store are actually incorrect. _tile1024i is defined as typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); and the loads/stores are for assignments/reads from variables that have that type, which is <256 x i32> in IR. So it is not obvious to me why the loads/stores would be wrong, as long as _tile1024i is defined as it is (if it would be a different type, that all changes).

As a consequence, __builtin_ia32_tilezero_internal & the other builtins need to be defined as returning _tile1024i / <256 x i32>. I don't think there's any other way to specify this, unless you have a dedicated AMX type in the frontend. IIUC the current lowering is to call an intrinsic that returns x86_amx and then a bitcast is used for the conversion to the result type <256 x i32>, with the (incorrect) assumption that the bitcast does complex conversion between types. Another consequence of the builtins returning _tile1024i / <256 x i32> is that the conversion from the intrinsic result to <256 x i32> should happen at the place where Clang emits the call to the intrinsic directly, not patched up later as it is done now.

In D99152#2655357, @fhahn wrote:

In D99152#2655304, @lebedev.ri wrote:

In D99152#2655274, @fhahn wrote:

In D99152#2649520, @LuoYuanke wrote:

In D99152#2647681, @fhahn wrote:

I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?

I create another example at https://gcc.godbolt.org/z/v6od5ceEz. In bar() function, you can see the load <256 x i32>* in the IR. The bar() function is actually buggy code, because tilec is not initialized by amx intrinsics. We want user call amx intrinsic to load/store tile explicitly. Ideally front-end can detect the issue and report error.

Thanks AFAIK in those cases the conversion intrinsic makes sense to use, because you effectively need to convert between 2 types in a non-trivial way. @lebedev.ri WDYT?

I'm not sure. I think first and foremost the load/store miscompile should be addressed.
I think the rest is confusing because it seems to me that the only reason why that bitcast
is needed is not correctness reason, but as an opaque optimization barrier.

I'm not sure if the loads and store are actually incorrect. _tile1024i is defined as typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); and the loads/stores are for assignments/reads from variables that have that type, which is <256 x i32> in IR. So it is not obvious to me why the loads/stores would be wrong, as long as _tile1024i is defined as it is (if it would be a different type, that all changes).

Didn't @LuoYuanke state:

In D99152#2643821, @LuoYuanke wrote:

@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right. For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.

Note the So the data is not contiguous in memory..
I.e. we won't actually store to i8* [ptr + 0, ptr + 4096) byte area.
Is plain load/store not defined to perform operations on contigious chunks of memory?
Am i completely missing the point?

As a consequence, __builtin_ia32_tilezero_internal & the other builtins need to be defined as returning _tile1024i / <256 x i32>. I don't think there's any other way to specify this, unless you have a dedicated AMX type in the frontend. IIUC the current lowering is to call an intrinsic that returns x86_amx and then a bitcast is used for the conversion to the result type <256 x i32>, with the (incorrect) assumption that the bitcast does complex conversion between types. Another consequence of the builtins returning _tile1024i / <256 x i32> is that the conversion from the intrinsic result to <256 x i32> should happen at the place where Clang emits the call to the intrinsic directly, not patched up later as it is done now.

/store

In D99152#2655371, @lebedev.ri wrote:

In D99152#2655357, @fhahn wrote:

snip

I'm not sure if the loads and store are actually incorrect. _tile1024i is defined as typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); and the loads/stores are for assignments/reads from variables that have that type, which is <256 x i32> in IR. So it is not obvious to me why the loads/stores would be wrong, as long as _tile1024i is defined as it is (if it would be a different type, that all changes).

Didn't @LuoYuanke state:

In D99152#2643821, @LuoYuanke wrote:

@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right. For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.

Note the So the data is not contiguous in memory..
I.e. we won't actually store to i8* [ptr + 0, ptr + 4096) byte area.
Is plain load/store not defined to perform operations on contigious chunks of memory?
Am i completely missing the point?

I think that point was not really clear during the discussion. Using load <256 x i32> to lower __tile_loadd() would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j

void foo() {
  tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
}

is lowered to

define dso_local void @foo() #0 {
  %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64)
  %2 = bitcast x86_amx %1 to <256 x i32>
  store <256 x i32> %2, <256 x i32>* @tilea, align 64
  ret void
}

So we emit an intrinsic to do the strided load and the result is stored to continuous memory, which is what the type _tile1024i requires. What's not modeled correctly is the conversion between the result of @llvm.x86.tileloadd64.internal and the store. It needs to be transferred in a flat vector.

Whether we should have x86_amx in the first place is a separate question I think. Having a builtin type that does not work properly with fundamental instructions like load or store seems prone for errors (what instructions actually work with x86_amx? Do binary operators work?). Perhaps it would be possible and sufficient to have the intrinsics use an opaque type instead of a builtin type, like

%my_x86_amx = type opaque

define %my_x86_amx @foo(%my_x86_amx %x) {
  ret %my_x86_amx %x
}

But I think we should address those 2 issues separately and fix the biggest problem (mis-use of bitcast) first, perhaps followed up by verifier rules rejecting x86_amx from un-suitable instructions and go from there.

LuoYuanke added subscribers: smaslov, kbsmith1.Mar 29 2021, 6:15 AM

LuoYuanke edited subscribers, added: andrew.w.kaylor; removed: kbsmith1.

I think that point was not really clear during the discussion. Using load <256 x i32> to lower __tile_loadd() would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j

The load/store <256 x i32> is generated by front-end, because in C language tile is a vector <256 x i32>. The load/store <256 x i32> is transformed to llvm.x86.tileloadd64.internal/llvm.x86.tilestored64.internal in lib/Target/X86/X86LowerAMXType.cpp if the load result is to be an operand of amx intrinsics or the store value is returned from amx intrinsics.

void foo() {
  tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
}
is lowered to
define dso_local void @foo() #0 {
  %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64)
  %2 = bitcast x86_amx %1 to <256 x i32>
  store <256 x i32> %2, <256 x i32>* @tilea, align 64
  ret void
}
So we emit an intrinsic to do the strided load and the result is stored to continuous memory, which is what the type _tile1024i requires. What's not modeled correctly is the conversion between the result of @llvm.x86.tileloadd64.internal and the store. It needs to be transferred in a flat vector.

Yes. I agree that it needs to be transferred in a flat vector.

Whether we should have x86_amx in the first place is a separate question I think. Having a builtin type that does not work properly with fundamental instructions like load or store seems prone for errors (what instructions actually work with x86_amx? Do binary operators work?). Perhaps it would be possible and sufficient to have the intrinsics use an opaque type instead of a builtin type, like

We only support tileload, tilestore, tilezero, tiletdp (dot product) instructions/intrinsics for x86_amx. Is there any opaque type example llvm source code for builtin? This example has some error at https://gcc.godbolt.org/z/ar6WhjTMz.

%my_x86_amx = type opaque

define %my_x86_amx @foo(%my_x86_amx %x) {
  ret %my_x86_amx %x
}
But I think we should address those 2 issues separately and fix the biggest problem (mis-use of bitcast) first, perhaps followed up by verifier rules rejecting x86_amx from un-suitable instructions and go from there.

I may further implement this patch and transform/eliminate @llvm.x86.vector.amx.cast in lib/Target/X86/X86LowerAMXType.cpp which is before codegen. There is some effort to implement it, but I'd like to take a try.

In D99152#2655691, @LuoYuanke wrote:

I think that point was not really clear during the discussion. Using load <256 x i32> to lower __tile_loadd() would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j

The load/store <256 x i32> is generated by front-end, because in C language tile is a vector <256 x i32>. The load/store <256 x i32> is transformed to llvm.x86.tileloadd64.internal/llvm.x86.tilestored64.internal in lib/Target/X86/X86LowerAMXType.cpp if the load result is to be an operand of amx intrinsics or the store value is returned from amx intrinsics.

Sure, you can get rid of unnecessary conversions/loads/stores during optimizations, if you can operate on AMX tiles directly and do not need to store the intermediate results in flat vectors. You can also use strided loads/stores to store continuous memory (no stride between columns/rows). But what optimizations are applied later do not impact the whether IR emitted by Clang is correct or now. It always needs to be correct. Whether to further optimizations are correct is a different problem, but we need a specification for the builtins, intrinsics and the type before going any further in that direction.

void foo() {
  tilea = __builtin_ia32_tileloadd64_internal(16, 64, buf, 64);
}
is lowered to
define dso_local void @foo() #0 {
  %1 = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64)
  %2 = bitcast x86_amx %1 to <256 x i32>
  store <256 x i32> %2, <256 x i32>* @tilea, align 64
  ret void
}
So we emit an intrinsic to do the strided load and the result is stored to continuous memory, which is what the type _tile1024i requires. What's not modeled correctly is the conversion between the result of @llvm.x86.tileloadd64.internal and the store. It needs to be transferred in a flat vector.
Yes. I agree that it needs to be transferred in a flat vector.

Whether we should have x86_amx in the first place is a separate question I think. Having a builtin type that does not work properly with fundamental instructions like load or store seems prone for errors (what instructions actually work with x86_amx? Do binary operators work?). Perhaps it would be possible and sufficient to have the intrinsics use an opaque type instead of a builtin type, like

We only support tileload, tilestore, tilezero, tiletdp (dot product) instructions/intrinsics for x86_amx. Is there any opaque type example llvm source code for builtin? This example has some error at https://gcc.godbolt.org/z/ar6WhjTMz.

I think you need to set the input to LLVM IR: https://gcc.godbolt.org/z/WexMjsas9

You should be able to use opaque types with overloaded intrinsics. I don't think you define an intrinsic to take a specific opaque type (because it's not known up front).

Whether to further optimizations are correct is a different problem, but we need a specification for the builtins, intrinsics and the type before going any further in that direction.

I think you need to set the input to LLVM IR: https://gcc.godbolt.org/z/WexMjsas9

You should be able to use opaque types with overloaded intrinsics. I don't think you define an intrinsic to take a specific opaque type (because it's not known up front).

The opaque type (https://llvm.org/docs/LangRef.html#opaque-structure-types) is pretty new to me. I didn't find any example for the opaque type in builtins or intrinsics. I am appreciated if you would write an example code (maybe tilezero) for the builtins, intrinsics and the type, so that I can understand it well. If we use <256 x i32> in builtins and use x86_amx in intrinsics, and have an specific intrinsics to covert x86_amx to flat vector <256 x i32>, I am able to do it by myself.

In D99152#2655880, @LuoYuanke wrote:

Whether to further optimizations are correct is a different problem, but we need a specification for the builtins, intrinsics and the type before going any further in that direction.

I think you need to set the input to LLVM IR: https://gcc.godbolt.org/z/WexMjsas9

You should be able to use opaque types with overloaded intrinsics. I don't think you define an intrinsic to take a specific opaque type (because it's not known up front).

The opaque type (https://llvm.org/docs/LangRef.html#opaque-structure-types) is pretty new to me. I didn't find any example for the opaque type in builtins or intrinsics. I am appreciated if you would write an example code (maybe tilezero) for the builtins, intrinsics and the type, so that I can understand it well. If we use <256 x i32> in builtins and use x86_amx in intrinsics, and have an specific intrinsics to covert x86_amx to flat vector <256 x i32>, I am able to do it by myself.

Unfortunately this is not possible to use an opaque type with the AMX intrinsics at the moment, because of the way they are define. It is possible to use opaque types with intrinsics in general though, e.g. see https://llvm.godbolt.org/z/Ezhf6535c

My point is, you should be able to adjust the definitions of the AMX intrinsics and then just replace all occurrences of x86_amx in your examples with a opaque type you define in the module. But as I said initially, you don't need to do everything at once (and you probably shouldn't). I'd start with addressing the bitcast issue and tackle the x86_amx type itself once that is done.

(And I am also not saying that it definitely needs to be removed, only that if it should be kept in the long run, it would be good to specify it in the LangRef and should have a good justification, especially if there are no instructions that do anything meaningful with values of the type other than take it as arguments and return values. Opaque types are a suggestion for an alternative that *may* be viable without a dedicated first-class type)

Unfortunately this is not possible to use an opaque type with the AMX intrinsics at the moment, because of the way they are define. It is possible to use opaque types with intrinsics in general though, e.g. see https://llvm.godbolt.org/z/Ezhf6535c

My point is, you should be able to adjust the definitions of the AMX intrinsics and then just replace all occurrences of x86_amx in your examples with a opaque type you define in the module. But as I said initially, you don't need to do everything at once (and you probably shouldn't). I'd start with addressing the bitcast issue and tackle the x86_amx type itself once that is done.

(And I am also not saying that it definitely needs to be removed, only that if it should be kept in the long run, it would be good to specify it in the LangRef and should have a good justification, especially if there are no instructions that do anything meaningful with values of the type other than take it as arguments and return values. Opaque types are a suggestion for an alternative that *may* be viable without a dedicated first-class type)

Thank you for the suggestion. So here is my plan.

specify x86_amx in LangRef.
Add llvm.x86.tile.cast intrinsic.
Optimize some of llvm.x86.tile.cast code as bitcast does, and transform llvm.x86.tile.cast to amx intrinsic if it can't be eliminated.
After the above 3 items are finished, replace bitcast with llvm.x86.tile.cast in front-end when generate IR for amx builtin.
After some time for stabilization, remove bitcast transform code from LLVM.
After all of the llvm.x86.tile.cast work is finished, let's discuss about opaque type.

Does that looks good to you?

In D99152#2661296, @LuoYuanke wrote:

Unfortunately this is not possible to use an opaque type with the AMX intrinsics at the moment, because of the way they are define. It is possible to use opaque types with intrinsics in general though, e.g. see https://llvm.godbolt.org/z/Ezhf6535c

My point is, you should be able to adjust the definitions of the AMX intrinsics and then just replace all occurrences of x86_amx in your examples with a opaque type you define in the module. But as I said initially, you don't need to do everything at once (and you probably shouldn't). I'd start with addressing the bitcast issue and tackle the x86_amx type itself once that is done.

(And I am also not saying that it definitely needs to be removed, only that if it should be kept in the long run, it would be good to specify it in the LangRef and should have a good justification, especially if there are no instructions that do anything meaningful with values of the type other than take it as arguments and return values. Opaque types are a suggestion for an alternative that *may* be viable without a dedicated first-class type)

Thank you for the suggestion. So here is my plan.

specify x86_amx in LangRef.

Add llvm.x86.tile.cast intrinsic.

Optimize some of llvm.x86.tile.cast code as bitcast does, and transform llvm.x86.tile.cast to amx intrinsic if it can't be eliminated.

After the above 3 items are finished, replace bitcast with llvm.x86.tile.cast in front-end when generate IR for amx builtin.

After some time for stabilization, remove bitcast transform code from LLVM.

After all of the llvm.x86.tile.cast work is finished, let's discuss about opaque type.

Does that looks good to you?

Sounds good to me, but it might be good to also share this in the thread on llvm-dev, as there might be additional feedback :)

LuoYuanke mentioned this in D107544: [X86] [AMX] Replace bitcast with specific AMX intrinsics with X86 specific cast..Aug 7 2021, 2:41 AM

yubing mentioned this in rGbcec4ccd04ae: [X86] [AMX] Replace bitcast with specific AMX intrinsics with X86 specific cast..Aug 17 2021, 2:04 AM

Matt added a subscriber: Matt.Oct 2 2021, 6:25 AM

What's the status here?

LuoYuanke mentioned this in D122567: [X86][AMX] enable amx cast intrinsics in FE..Mar 28 2022, 3:07 AM

In D99152#3235546, @lebedev.ri wrote:

What's the status here?

Here is the patch D122567.

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2022, 3:10 AM

LuoYuanke mentioned this in rG979d876bb4e9: [X86][AMX] enable amx cast intrinsics in FE..Apr 1 2022, 11:04 PM

LuoYuanke abandoned this revision.Apr 1 2022, 11:04 PM

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGBuiltin.cpp

13 lines

test/

CodeGen/

X86/

amx_cast.c

90 lines

llvm/

include/

llvm/

IR/

IntrinsicsX86.td

2 lines

Diff 332553

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,133 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = E->getNumArgs(); i != e; ++i) {
ArgValue = Builder.CreateAddrSpaceCast(		ArgValue = Builder.CreateAddrSpaceCast(
ArgValue,		ArgValue,
ArgValue->getType()->getPointerTo(PtrTy->getAddressSpace()));		ArgValue->getType()->getPointerTo(PtrTy->getAddressSpace()));
}		}
}		}

assert(PTy->canLosslesslyBitCastTo(FTy->getParamType(i)) &&		assert(PTy->canLosslesslyBitCastTo(FTy->getParamType(i)) &&
"Must be able to losslessly bit cast to param");		"Must be able to losslessly bit cast to param");
		if (PTy->isX86_AMXTy() \|\| ArgValue->getType()->isX86_AMXTy())
		ArgValue =
		Builder.CreateIntrinsic(Intrinsic::x86_vector_amx_cast,
		{PTy, ArgValue->getType()}, {ArgValue});
		else
ArgValue = Builder.CreateBitCast(ArgValue, PTy);		ArgValue = Builder.CreateBitCast(ArgValue, PTy);
}		}

Args.push_back(ArgValue);		Args.push_back(ArgValue);
}		}

Value *V = Builder.CreateCall(F, Args);		Value *V = Builder.CreateCall(F, Args);
QualType BuiltinRetType = E->getType();		QualType BuiltinRetType = E->getType();

llvm::Type *RetTy = VoidTy;		llvm::Type *RetTy = VoidTy;
if (!BuiltinRetType->isVoidType())		if (!BuiltinRetType->isVoidType())
RetTy = ConvertType(BuiltinRetType);		RetTy = ConvertType(BuiltinRetType);

if (RetTy != V->getType()) {		if (RetTy != V->getType()) {
// XXX - vector of pointers?		// XXX - vector of pointers?
if (auto *PtrTy = dyn_cast<llvm::PointerType>(RetTy)) {		if (auto *PtrTy = dyn_cast<llvm::PointerType>(RetTy)) {
if (PtrTy->getAddressSpace() != V->getType()->getPointerAddressSpace()) {		if (PtrTy->getAddressSpace() != V->getType()->getPointerAddressSpace()) {
V = Builder.CreateAddrSpaceCast(		V = Builder.CreateAddrSpaceCast(
V, V->getType()->getPointerTo(PtrTy->getAddressSpace()));		V, V->getType()->getPointerTo(PtrTy->getAddressSpace()));
}		}
}		}

assert(V->getType()->canLosslesslyBitCastTo(RetTy) &&		assert(V->getType()->canLosslesslyBitCastTo(RetTy) &&
"Must be able to losslessly bit cast result type");		"Must be able to losslessly bit cast result type");
		if (RetTy->isX86_AMXTy() \|\| V->getType()->isX86_AMXTy())
		V = Builder.CreateIntrinsic(Intrinsic::x86_vector_amx_cast,
		{RetTy, V->getType()}, {V});
		else
V = Builder.CreateBitCast(V, RetTy);		V = Builder.CreateBitCast(V, RetTy);
}		}

return RValue::get(V);		return RValue::get(V);
}		}

// Some target-specific builtins can have aggregate return values, e.g.		// Some target-specific builtins can have aggregate return values, e.g.
// __builtin_arm_mve_vld2q_u32. So if the result is an aggregate, force		// __builtin_arm_mve_vld2q_u32. So if the result is an aggregate, force
// ReturnValue to be non-null, so that the target-specific emission code can		// ReturnValue to be non-null, so that the target-specific emission code can
▲ Show 20 Lines • Show All 12,693 Lines • Show Last 20 Lines

clang/test/CodeGen/X86/amx_cast.c

This file was added.

				// RUN: %clang_cc1 %s -O2 -ffreestanding -triple=x86_64-unknown-unknown -target-feature +avx512f -target-feature +amx-int8 \
				lebedev.riUnsubmitted Not Done Reply Inline Actions Please don't use llvm optimizations in clang tests. lebedev.ri: Please don't use llvm optimizations in clang tests.
				// RUN: -target-feature +amx-bf16 -emit-llvm -o - -Werror -pedantic \| FileCheck %s --check-prefixes=CHECK

				#include <immintrin.h>
				lebedev.riUnsubmitted Not Done Reply Inline Actions The tests should be hermetic. Which `immintrin.h` is being used there? Hopefully the one from clang, not the one from system? lebedev.ri: The tests should be hermetic. Which `immintrin.h` is being used there? Hopefully the one from…

				char buf[1024];
				#define STRIDE 32

				char buf2[1024];

				void test1() {
				//CHECK: %0 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 8, i16 8, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32) #1
				//CHECK: %1 = tail call <256 x i32> @llvm.x86.vector.amx.cast.v256i32.x86amx(x86_amx %0) #1
				//CHECK: %2 = tail call x86_amx @llvm.x86.vector.amx.cast.x86amx.v256i32(<256 x i32> %1) #1
				//CHECK: tail call void @llvm.x86.tilestored64.internal(i16 8, i16 8, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, x86_amx %2) #1
				__tile1024i a = {8, 8};
				__tile1024i b = {8, 8};

				__tile_loadd(&a, buf, STRIDE);
				__tile_stored(buf, STRIDE, a);
				}

				void test2() {
				//CHECK: %0 = tail call x86_amx @llvm.x86.vector.amx.cast.x86amx.v256i32(<256 x i32> zeroinitializer) #1
				//CHECK: tail call void @llvm.x86.tilestored64.internal(i16 8, i16 8, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 32, x86_amx %0) #1
				__tile1024i a = {8, 8};

				__tile_stored(buf, STRIDE, a);
				}

				#define TILE_SZ 16
				void inner_product(int A_mem, int B_mem, int *C_mem, int M, int N, int K) {
				//CHECK: for.body6: ; preds = %for.body6.lr.ph, %for.cond.cleanup9
				//CHECK: %indvars.iv200 = phi i64 [ 0, %for.body6.lr.ph ], [ %indvars.iv.next201, %for.cond.cleanup9 ]
				//CHECK: %1 = tail call x86_amx @llvm.x86.tilezero.internal(i16 16, i16 64) #1
				//CHECK: %2 = tail call <256 x i32> @llvm.x86.vector.amx.cast.v256i32.x86amx(x86_amx %1) #1
				//CHECK: %3 = shl nsw i64 %indvars.iv200, 4
				//CHECK: br i1 %cmp8163, label %for.body10.lr.ph, label %for.cond.cleanup9
				//CHECK: for.body10.lr.ph: ; preds = %for.body6
				//CHECK: %add.ptr19 = getelementptr inbounds i32, i32* %B_mem, i64 %3
				//CHECK: br label %for.body10
				//CHECK: for.cond.cleanup9: ; preds = %for.body10, %for.body6
				//CHECK: %c.sroa.8127.2.lcssa = phi <256 x i32> [ %2, %for.body6 ], [ %18, %for.body10 ]
				//CHECK: %add.ptr31 = getelementptr inbounds i32, i32* %add.ptr28, i64 %3
				//CHECK: %4 = bitcast i32* %add.ptr31 to i8*
				//CHECK: %5 = tail call x86_amx @llvm.x86.vector.amx.cast.x86amx.v256i32(<256 x i32> %c.sroa.8127.2.lcssa) #1
				//CHECK: tail call void @llvm.x86.tilestored64.internal(i16 16, i16 64, i8* %4, i64 %mul24, x86_amx %5) #1
				//CHECK: %indvars.iv.next201 = add nuw nsw i64 %indvars.iv200, 1
				//CHECK: %exitcond205.not = icmp eq i64 %indvars.iv.next201, %wide.trip.count204
				//CHECK: br i1 %exitcond205.not, label %for.cond.cleanup5, label %for.body6, !llvm.loop !4
				//CHECK: for.body10: ; preds = %for.body10.lr.ph, %for.body10
				//CHECK: %indvars.iv = phi i64 [ 0, %for.body10.lr.ph ], [ %indvars.iv.next, %for.body10 ]
				//CHECK: %c.sroa.8127.2164 = phi <256 x i32> [ %2, %for.body10.lr.ph ], [ %18, %for.body10 ]
				//CHECK: %6 = shl nsw i64 %indvars.iv, 4
				//CHECK: %add.ptr14 = getelementptr inbounds i32, i32* %add.ptr, i64 %6
				//CHECK: %7 = bitcast i32* %add.ptr14 to i8*
				//CHECK: %8 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* %7, i64 %mul15) #1
				//CHECK: %9 = tail call <256 x i32> @llvm.x86.vector.amx.cast.v256i32.x86amx(x86_amx %8) #1
				//CHECK: %10 = mul nsw i64 %6, %conv23
				//CHECK: %add.ptr22 = getelementptr inbounds i32, i32* %add.ptr19, i64 %10
				//CHECK: %11 = bitcast i32* %add.ptr22 to i8*
				//CHECK: %12 = tail call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* %11, i64 %mul24) #1
				//CHECK: %13 = tail call <256 x i32> @llvm.x86.vector.amx.cast.v256i32.x86amx(x86_amx %12) #1
				//CHECK: %14 = tail call x86_amx @llvm.x86.vector.amx.cast.x86amx.v256i32(<256 x i32> %c.sroa.8127.2164) #1
				//CHECK: %15 = tail call x86_amx @llvm.x86.vector.amx.cast.x86amx.v256i32(<256 x i32> %9) #1
				//CHECK: %16 = tail call x86_amx @llvm.x86.vector.amx.cast.x86amx.v256i32(<256 x i32> %13) #1
				//CHECK: %17 = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 16, i16 64, i16 64, x86_amx %14, x86_amx %15, x86_amx %16) #1
				//CHECK: %18 = tail call <256 x i32> @llvm.x86.vector.amx.cast.v256i32.x86amx(x86_amx %17) #1
				//CHECK: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				//CHECK: %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
				//CHECK: br i1 %exitcond.not, label %for.cond.cleanup9, label %for.body10, !llvm.loop !5

				const int m = M / TILE_SZ;
				const int n = N / TILE_SZ;
				const int k = K / TILE_SZ;

				for (int i = 0; i < m; i++)
				for (int j = 0; j < n; j++) {
				__tile1024i c = {TILE_SZ, TILE_SZ*sizeof(int)};
				__tile_zero(&c);
				for (int l = 0; l < k; l++) {
				__tile1024i a = {TILE_SZ, TILE_SZ*sizeof(int)};
				__tile1024i b = {TILE_SZ, TILE_SZ*sizeof(int)};
				__tile_loadd(&a, A_mem+(iTILE_SZ)K+lTILE_SZ, Ksizeof(int));
				__tile_loadd(&b, B_mem+(lTILE_SZ)N+jTILE_SZ, Nsizeof(int));
				__tile_dpbssd(&c, a, b);
				}
				__tile_stored(C_mem+(iTILE_SZ)M+jTILE_SZ, Nsizeof(int), c);
				}
				}

llvm/include/llvm/IR/IntrinsicsX86.td

Show First 20 Lines • Show All 5,079 Lines • ▼ Show 20 Lines	def int_x86_tilezero_internal :
Intrinsic<[llvm_x86amx_ty], [llvm_i16_ty, llvm_i16_ty],		Intrinsic<[llvm_x86amx_ty], [llvm_i16_ty, llvm_i16_ty],
[]>;		[]>;
def int_x86_tdpbf16ps_internal :		def int_x86_tdpbf16ps_internal :
GCCBuiltin<"__builtin_ia32_tdpbf16ps_internal">,		GCCBuiltin<"__builtin_ia32_tdpbf16ps_internal">,
Intrinsic<[llvm_x86amx_ty],		Intrinsic<[llvm_x86amx_ty],
[llvm_i16_ty, llvm_i16_ty, llvm_i16_ty,		[llvm_i16_ty, llvm_i16_ty, llvm_i16_ty,
llvm_x86amx_ty, llvm_x86amx_ty,		llvm_x86amx_ty, llvm_x86amx_ty,
llvm_x86amx_ty], []>;		llvm_x86amx_ty], []>;
		def int_x86_vector_amx_cast :
		Intrinsic<[llvm_any_ty], [llvm_any_ty], [IntrNoMem]>;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// UINTR - User Level Interrupt		// UINTR - User Level Interrupt

let TargetPrefix = "x86" in {		let TargetPrefix = "x86" in {
def int_x86_clui : GCCBuiltin<"__builtin_ia32_clui">,		def int_x86_clui : GCCBuiltin<"__builtin_ia32_clui">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;
def int_x86_stui : GCCBuiltin<"__builtin_ia32_stui">,		def int_x86_stui : GCCBuiltin<"__builtin_ia32_stui">,
Intrinsic<[], [], []>;		Intrinsic<[], [], []>;
def int_x86_testui : GCCBuiltin<"__builtin_ia32_testui">,		def int_x86_testui : GCCBuiltin<"__builtin_ia32_testui">,
Intrinsic<[llvm_i8_ty], [], []>;		Intrinsic<[llvm_i8_ty], [], []>;
def int_x86_senduipi : GCCBuiltin<"__builtin_ia32_senduipi">,		def int_x86_senduipi : GCCBuiltin<"__builtin_ia32_senduipi">,
Intrinsic<[], [llvm_i64_ty], []>;		Intrinsic<[], [llvm_i64_ty], []>;
}		}