The x86_amx is used for AMX intrinsics. <256 x i32> is bitcasted to x86_amx when it is used by AMX intrinsics, and x86_amx is bitcasted to <256 x i32> when it is used by load/store instruction. So amx intrinsics only operate on type x86_amx. The new type x86_amx can help to separate amx intrinsics from llvm IR instructions (+-*/). Thank Craig for the idea. This patch depends on https://reviews.llvm.org/D87981.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
I only took a quick pass through this so far. What happens if a bitcast between x86amx and v256i32(or any other 1024-bit vector type) exists in the IR but isn't next to a load/store?
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
5348–5349 | Should this just be deleted? | |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
413 | Don't use an assert to check the result of a dyn_cast. If it shouldn't fail just use cast<LoadInst> which will assert internally. | |
421 | Unchecked dyn_cast | |
444 | Use cast. |
@craig.topper , thank you for reviewing my patch.
I think if user just use our external API, such IR won't be generated. However if there is such IR, we can transform bitcast to <store, load>, so that the type can be translated through memory. One of <store, load> is AMX intrinsics store/load, so it won't be optimized. Is it reasonable?
llvm/lib/IR/DataLayout.cpp | ||
---|---|---|
796 | Yes. It is 512. Thanks. | |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
391–392 | Yes. It is in function's entry block. It is done in line 48 of function CreateAllocaInst(). CreateAllocaInst() is actually copied from your code. :) | |
398 | 1024 is conservatives, because vector require the alignment to be the vector size. Here generate vector <256 x i32> load/store. |
llvm/lib/IR/ConstantFold.cpp | ||
---|---|---|
540 | Operation should at the end of the line. | |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
176 | I think we'd better to check exceptions. E.g. default: llvm_unreachable(""); case Intrinsic::x86_tileloadd64_internal: case Intrinsic::x86_tdpbssd_internal: case Intrinsic::x86_tilestored64_internal: Row = II->getArgOperand(0); Col = II->getArgOperand(1); break; | |
239 | Why don't check empty like line 157? | |
393–394 | Better move it to line 310. | |
396–403 | Better to reuse the cast result, e.g. BitCastInst *BInst = dyn_cast<BitCastInst>(&Inst); if (!BInst ) You can save several cast<BitCastInst>(&Inst) below. | |
429 | Is it possible the x86_amx operand isn't from AMX intrinsic, e.g. %src = bitcast <256 x i32> %xxx to x86_amx %2 = bitcast x86_amx %src to <256 x i32> | |
430 | Where's x86_amx* %tile from? Shouldn't been transfered to x86_amx before this bitcast if it exists? | |
436 | Maybe better to keep a duplicated load that calling transformBitcast. The same for line 285. | |
451 | Why we need to consider <256 x i32> has more than one use? | |
llvm/test/CodeGen/X86/AMX/amx-across-func.ll | ||
89–91 | Better to remove these unused attributes. The same to other tests. | |
llvm/test/CodeGen/X86/AMX/amx-type.ll | ||
67 | For this and the next test, we have chances to optimize to memcpy if we can make sure %s is constant 64. | |
145 | We don't need to check this case now, right? |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
264 | There may be dead load or store instructions. | |
280 | Ok, I'll create another patch for it. | |
396–403 | That's good. Thanks. | |
429 | Good catch. I'll add support for this pattern. | |
llvm/test/CodeGen/X86/AMX/amx-across-func.ll | ||
89–91 | I'll create a separate patch to clean the attributes. | |
llvm/test/CodeGen/X86/AMX/amx-type.ll | ||
67 | If the stride is 64 we can transform the code to memcpy. How about do it in another patch? | |
145 | It can check the load and store instruction is not transformed if they are not participate in amx operation. I prefer to keep the case. |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
430 | In my test case, it is transformed after Combine redundant instructions. *** IR Dump After Simplify the CFG *** define internal fastcc void @_ZL12__tile_loaddP15__tile1024i_strPKvm(%struct.__tile1024i_str* nocapture %dst) unnamed_addr #4 { entry: %row = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 0 %0 = load i16, i16* %row, align 64, !tbaa !2 %col = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 1 %1 = load i16, i16* %col, align 2, !tbaa !7 %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %0, i16 %1, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64) #6 %3 = bitcast x86_amx %2 to <256 x i32> %tile = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 3 store <256 x i32> %3, <256 x i32>* %tile, align 64, !tbaa !8 ret void } To *** IR Dump After Combine redundant instructions *** ; Function Attrs: alwaysinline nounwind uwtable mustprogress define internal fastcc void @_ZL12__tile_loaddP15__tile1024i_strPKvm(%struct.__tile1024i_str* nocapture %dst) unnamed_addr #4 { entry: %row = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 0 %0 = load i16, i16* %row, align 64, !tbaa !2 %col = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 1 %1 = load i16, i16* %col, align 2, !tbaa !7 %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %0, i16 %1, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64) #6 %tile = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 3 %3 = bitcast <256 x i32>* %tile to x86_amx* store x86_amx %2, x86_amx* %3, align 64, !tbaa !8 ret void } |
In my test case, it is transformed after Combine redundant instructions.
Can we disable it for AMX type? The pointer to AMX type is meaningless and may result in bad perfomance.
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
278 | I don't see any chance this happen. But we still need to handle the x86_amx* here if possible, right? cast<PointerType>(Src->getType())->isX86_AMXTy() |
Good job.
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
153 | Currently, we don't have HW type for v256i32. I think 64 bytes(512bits) should be enough here. | |
233–250 | Value | |
235 | How about the Tile comes from tdpbssd? | |
398 | We don't need to align to 1024. 64 should be enough. The same for below comments. | |
422 | How about the Tile comes from tdpbssd? |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
235 | We have a convention, when amx intrinsics define a x86_amx tile the first 2 operands is the shape of the defined tile. For tdpbssd, the intrinsics operands are (m, n, k, ...). (m, n) is the shape of the produced tile. |
LGTM. Thanks for the refactors. Maybe better to wait for a few days to see if others have objections.
llvm/include/llvm/IR/Type.h | ||
---|---|---|
68 | This addition causes a compilation warning in HexagonTargetObjectFile.cpp: ../lib/Target/Hexagon/HexagonTargetObjectFile.cpp:297:11: error: enumeration value 'X86_AMXTyID' not handled in switch [-Werror,-Wswitch] switch (Ty->getTypeID()) { ^ 1 error generated. Seen in build bots, e.g. here: |
llvm/include/llvm/IR/Type.h | ||
---|---|---|
68 | Thanks Mikael for pointing it out. I think we just need to put the type in the switch table. I've posted a patch to fix it. rG16c2067cf212. |
llvm/include/llvm/IR/Type.h | ||
---|---|---|
68 | Yep, thanks! |
D93944 fixed an llvm-c-test issue. Note, adding new enum members usually requires check-all (at least check-llvm, but Clang may use these enum as well) because they can be used everywhere.
llvm/include/llvm-c/Core.h | ||
---|---|---|
163 | This is a breaking change to the C ABI -- can we move it to the end of the enum? |
llvm/include/llvm-c/Core.h | ||
---|---|---|
163 |
This is a breaking change to the C ABI -- can we move it to the end of the enum?
https://bugs.llvm.org/show_bug.cgi?id=48905