The x86_amx is used for AMX intrinsics. <256 x i32> is bitcasted to x86_amx when it is used by AMX intrinsics, and x86_amx is bitcasted to <256 x i32> when it is used by load/store instruction. So amx intrinsics only operate on type x86_amx. The new type x86_amx can help to separate amx intrinsics from llvm IR instructions (+-*/). Thank Craig for the idea. This patch depends on https://reviews.llvm.org/D87981.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
| Time | Test | |
|---|---|---|
| 2,530 ms | x64 debian > libarcher.races::task-two.c | 
Event Timeline
I only took a quick pass through this so far. What happens if a bitcast between x86amx and v256i32(or any other 1024-bit vector type) exists in the IR but isn't next to a load/store?
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 5348–5349 | Should this just be deleted? | |
| llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
| 246 | Don't use an assert to check the result of a dyn_cast. If it shouldn't fail just use cast<LoadInst> which will assert internally. | |
| 254 | Unchecked dyn_cast | |
| 277 | Use cast. | |
@craig.topper , thank you for reviewing my patch.
I think if user just use our external API, such IR won't be generated. However if there is such IR, we can transform bitcast to <store, load>, so that the type can be translated through memory. One of <store, load> is AMX intrinsics store/load, so it won't be optimized. Is it reasonable?
| llvm/lib/IR/DataLayout.cpp | ||
|---|---|---|
| 796 | Yes. It is 512. Thanks. | |
| llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
| 166–206 | Yes. It is in function's entry block. It is done in line 48 of function CreateAllocaInst(). CreateAllocaInst() is actually copied from your code. :) | |
| 173 | 1024 is conservatives, because vector require the alignment to be the vector size. Here generate vector <256 x i32> load/store. | |
| llvm/lib/IR/ConstantFold.cpp | ||
|---|---|---|
| 540 | Operation should at the end of the line. | |
| llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
| 60 | I think we'd better to check exceptions. E.g. default:
  llvm_unreachable("");
case Intrinsic::x86_tileloadd64_internal:
case Intrinsic::x86_tdpbssd_internal:
case Intrinsic::x86_tilestored64_internal:
  Row = II->getArgOperand(0);
  Col = II->getArgOperand(1);
  break; | |
| 122 | Why don't check empty like line 157? | |
| 204 | Is it possible the x86_amx operand isn't from AMX intrinsic, e.g. %src = bitcast <256 x i32> %xxx to x86_amx %2 = bitcast x86_amx %src to <256 x i32> | |
| 221–224 | Better move it to line 310. | |
| 236–237 | Better to reuse the cast result, e.g. BitCastInst *BInst = dyn_cast<BitCastInst>(&Inst); if (!BInst ) You can save several cast<BitCastInst>(&Inst) below. | |
| 263 | Where's x86_amx* %tile from? Shouldn't been transfered to x86_amx before this bitcast if it exists? | |
| 269 | Maybe better to keep a duplicated load that calling transformBitcast. The same for line 285. | |
| 284 | Why we need to consider <256 x i32> has more than one use? | |
| llvm/test/CodeGen/X86/AMX/amx-across-func.ll | ||
| 89–91 | Better to remove these unused attributes. The same to other tests. | |
| llvm/test/CodeGen/X86/AMX/amx-type.ll | ||
| 67 | For this and the next test, we have chances to optimize to memcpy if we can make sure %s is constant 64. | |
| 103 | We don't need to check this case now, right? | |
| llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
|---|---|---|
| 155 | Ok, I'll create another patch for it. | |
| 163 | There may be dead load or store instructions. | |
| 204 | Good catch. I'll add support for this pattern. | |
| 236–237 | That's good. Thanks. | |
| llvm/test/CodeGen/X86/AMX/amx-across-func.ll | ||
| 89–91 | I'll create a separate patch to clean the attributes. | |
| llvm/test/CodeGen/X86/AMX/amx-type.ll | ||
| 67 | If the stride is 64 we can transform the code to memcpy. How about do it in another patch? | |
| 103 | It can check the load and store instruction is not transformed if they are not participate in amx operation. I prefer to keep the case. | |
| llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
|---|---|---|
| 263 | In my test case, it is transformed after Combine redundant instructions. *** IR Dump After Simplify the CFG ***
define internal fastcc void @_ZL12__tile_loaddP15__tile1024i_strPKvm(%struct.__tile1024i_str* nocapture %dst) unnamed_addr #4 {
entry:
  %row = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 0
  %0 = load i16, i16* %row, align 64, !tbaa !2
  %col = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 1
  %1 = load i16, i16* %col, align 2, !tbaa !7
  %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %0, i16 %1, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64) #6
  %3 = bitcast x86_amx %2 to <256 x i32>
  %tile = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 3
  store <256 x i32> %3, <256 x i32>* %tile, align 64, !tbaa !8
  ret void
}To *** IR Dump After Combine redundant instructions ***
; Function Attrs: alwaysinline nounwind uwtable mustprogress
define internal fastcc void @_ZL12__tile_loaddP15__tile1024i_strPKvm(%struct.__tile1024i_str* nocapture %dst) unnamed_addr #4 {
entry:
  %row = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 0
  %0 = load i16, i16* %row, align 64, !tbaa !2
  %col = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 1
  %1 = load i16, i16* %col, align 2, !tbaa !7
  %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %0, i16 %1, i8* getelementptr inbounds ([1024 x i8], [1024
x i8]* @buf, i64 0, i64 0), i64 64) #6
  %tile = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 3
  %3 = bitcast <256 x i32>* %tile to x86_amx*
  store x86_amx %2, x86_amx* %3, align 64, !tbaa !8
  ret void
} | |
In my test case, it is transformed after Combine redundant instructions.
Can we disable it for AMX type? The pointer to AMX type is meaningless and may result in bad perfomance.
| llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
|---|---|---|
| 153 | I don't see any chance this happen. But we still need to handle the x86_amx* here if possible, right? cast<PointerType>(Src->getType())->isX86_AMXTy() | |
Good job.
| llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
|---|---|---|
| 50 | Currently, we don't have HW type for v256i32. I think 64 bytes(512bits) should be enough here. | |
| 124–150 | Value | |
| 126 | How about the Tile comes from tdpbssd? | |
| 173 | We don't need to align to 1024. 64 should be enough. The same for below comments. | |
| 197 | How about the Tile comes from tdpbssd? | |
| llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
|---|---|---|
| 126 | We have a convention, when amx intrinsics define a x86_amx tile the first 2 operands is the shape of the defined tile. For tdpbssd, the intrinsics operands are (m, n, k, ...). (m, n) is the shape of the produced tile. | |
LGTM. Thanks for the refactors. Maybe better to wait for a few days to see if others have objections.
| llvm/include/llvm/IR/Type.h | ||
|---|---|---|
| 68 | This addition causes a compilation warning in HexagonTargetObjectFile.cpp: ../lib/Target/Hexagon/HexagonTargetObjectFile.cpp:297:11: error: enumeration value 'X86_AMXTyID' not handled in switch [-Werror,-Wswitch]
  switch (Ty->getTypeID()) {
          ^
1 error generated.Seen in build bots, e.g. here: | |
| llvm/include/llvm/IR/Type.h | ||
|---|---|---|
| 68 | Thanks Mikael for pointing it out. I think we just need to put the type in the switch table. I've posted a patch to fix it. rG16c2067cf212. | |
| llvm/include/llvm/IR/Type.h | ||
|---|---|---|
| 68 | Yep, thanks! | |
D93944 fixed an llvm-c-test issue. Note, adding new enum members usually requires check-all (at least check-llvm, but Clang may use these enum as well) because they can be used everywhere.
| llvm/include/llvm-c/Core.h | ||
|---|---|---|
| 163 | This is a breaking change to the C ABI -- can we move it to the end of the enum? | |
| llvm/include/llvm-c/Core.h | ||
|---|---|---|
| 163 | ||
This is a breaking change to the C ABI -- can we move it to the end of the enum?
https://bugs.llvm.org/show_bug.cgi?id=48905