Address Craig's comments.
Sun, Jun 13
Update test case to pass expensive check.
Sat, Jun 12
Fri, Jun 11
I don't what Intel's original failure looked like, but here's a test that should reproduce this with -run-pass=machinelicm https://reviews.llvm.org/P8267 needs more cleanup.
I hacked the MIR just before machinelicm by sinking the CMP64mi32 and SETCCr into the loop. That makes MachineLICM want to unfold it since the load part is invariant being from a constant global.
Apply Craig's test case. Many thanks to Craig. :)
Thu, Jun 10
Thank @lebedev.ri. This is triggered by our internal code, the same test case passes with llvm trunk code. BTW, do you think we need to check immediate before getting it?
Wed, Jun 9
LGTM. Thank you!
Tue, Jun 8
Thu, Jun 3
We may add description on the intrinsic in docs/LangRef.rst.
Wed, May 26
Address Pengfei's comments.
Apr 27 2021
Apr 22 2021
Fix some descriptions.
Is there any test case for it?
Apr 20 2021
Apr 19 2021
Address Craig's comments.
Apr 14 2021
Address Roman's comments.
Remove attribute in test case.
Apr 13 2021
Address Pengfei's comments. Add amx description to BitCodeFormat.html.
LGTM. Thank you!
Apr 12 2021
Address Florian and Pengfei's comments.
LGTM. But wait one or two days to see if there is more comments from Craig and HJ.
Apr 11 2021
Apr 8 2021
Apr 7 2021
Apr 6 2021
LGMT. Thank you!
Apr 2 2021
Perhaps we need more comments and more test cases (maybe in a sperate file) to cover those scenario.
Apr 1 2021
A user interrupt is different than a regular interrupt right? It doesn't make sense that we would change the behavior of the interrupt calling convention just because the the user interrupt instructions are enabled. That would occur just from passing a -march for a newer CPU wouldn't it?
Mar 31 2021
Unfortunately this is not possible to use an opaque type with the AMX intrinsics at the moment, because of the way they are define. It is possible to use opaque types with intrinsics in general though, e.g. see https://llvm.godbolt.org/z/Ezhf6535c
My point is, you should be able to adjust the definitions of the AMX intrinsics and then just replace all occurrences of x86_amx in your examples with a opaque type you define in the module. But as I said initially, you don't need to do everything at once (and you probably shouldn't). I'd start with addressing the bitcast issue and tackle the x86_amx type itself once that is done.
(And I am also not saying that it definitely needs to be removed, only that if it should be kept in the long run, it would be good to specify it in the LangRef and should have a good justification, especially if there are no instructions that do anything meaningful with values of the type other than take it as arguments and return values. Opaque types are a suggestion for an alternative that *may* be viable without a dedicated first-class type)
Mar 30 2021
Mar 29 2021
Whether to further optimizations are correct is a different problem, but we need a specification for the builtins, intrinsics and the type before going any further in that direction.
I think you need to set the input to LLVM IR: https://gcc.godbolt.org/z/WexMjsas9
You should be able to use opaque types with overloaded intrinsics. I don't think you define an intrinsic to take a specific opaque type (because it's not known up front).
I think that point was not really clear during the discussion. Using load <256 x i32> to lower __tile_loadd() would indeed be incorrect. But I don't think that's happening at the moment, at least going from a simple example https://gcc.godbolt.org/z/KT5rczn8j
The load/store <256 x i32> is generated by front-end, because in C language tile is a vector <256 x i32>. The load/store <256 x i32> is transformed to llvm.x86.tileloadd64.internal/llvm.x86.tilestored64.internal in lib/Target/X86/X86LowerAMXType.cpp if the load result is to be an operand of amx intrinsics or the store value is returned from amx intrinsics.
Mar 24 2021
I can't see any load <256 x i32> in the linked example, just a store. Could you check the example?
IIUC you need this to transfer/convert data from a consecutive vector to an AMX tile. To express that, emitting an intrinsic for the conversion instead a bit cast seems the right thing to me.
Yes. We need to transfer/convert data from a consecutive vector to an AMX tile. Because in the C language interface the tile defined as vector. typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64))); Take below code (https://gcc.godbolt.org/z/noaWEWd6n) as an example.
Mar 23 2021
To be honest i don't really understand why x86_amx type is even there.
It seems to me that if you just directly used @llvm.x86.tileloadd64.internal / @llvm.x86.tilestored64.internal,
and s/x86_amx/<256 x i32>/, none of these problems would be here.
load instruction loads contigious bytes.
If that is not what is AMX is trying to use it for, then it is being used incorrectly.
@lebedev.ri, our goal is seeking a ideal solution, not arguing who is right. I hope there is no bias during the discussion. I hope Florian and James set a role model for you. They are trying to understand the problem and helping solve the problem. I don't know if it is the right way to stop other's patch based on your own preference.
@lebedev.ri, this patch is mainly for discussing the approach that Florian proposed, so I didn't polish my code. Nevertheless your comments for amx_cast.c is right. For __tile_loadd() is to load a 2d tile from memory. There is an extra parameter stride. As I explain in llvm-dev, it load each row from memory to tile register and then base += stride. So the data is not contiguous in memory.
Mar 21 2021
Mar 20 2021
Fix typo in commit message.