This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add x86_amx type for intel AMX.
ClosedPublic

Authored by LuoYuanke on Nov 21 2020, 4:51 PM.

Details

Summary

The x86_amx is used for AMX intrinsics. <256 x i32> is bitcasted to x86_amx when it is used by AMX intrinsics, and x86_amx is bitcasted to <256 x i32> when it is used by load/store instruction. So amx intrinsics only operate on type x86_amx. The new type x86_amx can help to separate amx intrinsics from llvm IR instructions (+-*/). Thank Craig for the idea. This patch depends on https://reviews.llvm.org/D87981.

Diff Detail

Event Timeline

LuoYuanke created this revision.Nov 21 2020, 4:51 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
LuoYuanke requested review of this revision.Nov 21 2020, 4:51 PM
LuoYuanke retitled this revision from [X86] Add x86_amx type for intel AMX. The x86_amx is used for AMX intrisics. <256 x i32> is bitcast to x86_amx when it is used by AMX intrinsics, and x86_amx is bitcast to <256 x i32> when it is used by load/store instruction. So amx intrinsics... to [X86] Add x86_amx type for intel AMX. .Nov 21 2020, 4:54 PM
LuoYuanke edited the summary of this revision. (Show Details)

I only took a quick pass through this so far. What happens if a bitcast between x86amx and v256i32(or any other 1024-bit vector type) exists in the IR but isn't next to a load/store?

llvm/lib/Target/X86/X86ISelLowering.cpp
5348–5349

Should this just be deleted?

llvm/lib/Target/X86/X86LowerAMXType.cpp
244

Don't use an assert to check the result of a dyn_cast. If it shouldn't fail just use cast<LoadInst> which will assert internally.

252

Unchecked dyn_cast

275

Use cast.

I only took a quick pass through this so far. What happens if a bitcast between x86amx and v256i32(or any other 1024-bit vector type) exists in the IR but isn't next to a load/store?

@craig.topper , thank you for reviewing my patch.
I think if user just use our external API, such IR won't be generated. However if there is such IR, we can transform bitcast to <store, load>, so that the type can be translated through memory. One of <store, load> is AMX intrinsics store/load, so it won't be optimized. Is it reasonable?

I only took a quick pass through this so far. What happens if a bitcast between x86amx and v256i32(or any other 1024-bit vector type) exists in the IR but isn't next to a load/store?

@craig.topper , thank you for reviewing my patch.
I think if user just use our external API, such IR won't be generated. However if there is such IR, we can transform bitcast to <store, load>, so that the type can be translated through memory. One of <store, load> is AMX intrinsics store/load, so it won't be optimized. Is it reasonable?

Its fine if its not optimized, just make sure it doesn't crash.

LuoYuanke updated this revision to Diff 307298.Nov 24 2020, 3:08 AM
LuoYuanke retitled this revision from [X86] Add x86_amx type for intel AMX. to [X86] Add x86_amx type for intel AMX..

Address Craig's comments.

LuoYuanke updated this revision to Diff 307300.Nov 24 2020, 3:16 AM

Address Craig's comments.
Change dyn_cast to cast.

LuoYuanke marked 3 inline comments as done.Nov 24 2020, 3:19 AM
pengfei added inline comments.Nov 24 2020, 4:30 AM
llvm/lib/IR/DataLayout.cpp
796

Should be 512 bits?

llvm/lib/Target/X86/X86LowerAMXType.cpp
171

Why the alignment not be 64?

pengfei added inline comments.Nov 24 2020, 4:43 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
279

%src is not used here.

llvm/utils/TableGen/IntrinsicEmitter.cpp
252

Remove ,

craig.topper added inline comments.Nov 24 2020, 10:45 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
164–204

Shouldn't this be in the function's entry block?

181

Just use Value. auto doesn't add any value other than shortening by 1 character.

293

maximun->maximum

297

Use Builder.getInt8PtrTy then you don't need Ctx

LuoYuanke marked an inline comment as done.Nov 24 2020, 9:39 PM
LuoYuanke added inline comments.
llvm/lib/IR/DataLayout.cpp
796

Yes. It is 512. Thanks.

llvm/lib/Target/X86/X86LowerAMXType.cpp
164–204

Yes. It is in function's entry block. It is done in line 48 of function CreateAllocaInst(). CreateAllocaInst() is actually copied from your code. :)

171

1024 is conservatives, because vector require the alignment to be the vector size. Here generate vector <256 x i32> load/store.

LuoYuanke updated this revision to Diff 307514.Nov 24 2020, 9:46 PM
LuoYuanke marked an inline comment as done.

Address Craig and Pengfei's comments.

LuoYuanke updated this revision to Diff 308146.Nov 28 2020, 3:32 AM

Add the handler of "bitcast <256 x i32>* to x86_amx*".
Refactor the code.

LuoYuanke updated this revision to Diff 309761.Dec 5 2020, 11:31 PM

Avoid generatng constant for x86_amx.

pengfei added inline comments.Dec 21 2020, 5:46 AM
llvm/lib/IR/ConstantFold.cpp
540

Operation should at the end of the line.

llvm/lib/Target/X86/X86LowerAMXType.cpp
58

I think we'd better to check exceptions. E.g.

default:
  llvm_unreachable("");
case Intrinsic::x86_tileloadd64_internal:
case Intrinsic::x86_tdpbssd_internal:
case Intrinsic::x86_tilestored64_internal:
  Row = II->getArgOperand(0);
  Col = II->getArgOperand(1);
  break;
120

Why don't check empty like line 157?

202

Is it possible the x86_amx operand isn't from AMX intrinsic, e.g.

%src = bitcast <256 x i32> %xxx to x86_amx
%2 = bitcast x86_amx %src to <256 x i32>
219–222

Better move it to line 310.

234–235

Better to reuse the cast result, e.g.

BitCastInst *BInst = dyn_cast<BitCastInst>(&Inst);
if (!BInst )

You can save several cast<BitCastInst>(&Inst) below.

261

Where's x86_amx* %tile from? Shouldn't been transfered to x86_amx before this bitcast if it exists?

267

Maybe better to keep a duplicated load that calling transformBitcast. The same for line 285.

282

Why we need to consider <256 x i32> has more than one use?

llvm/test/CodeGen/X86/AMX/amx-across-func.ll
89–91

Better to remove these unused attributes. The same to other tests.

llvm/test/CodeGen/X86/AMX/amx-type.ll
67

For this and the next test, we have chances to optimize to memcpy if we can make sure %s is constant 64.

102

We don't need to check this case now, right?

LuoYuanke updated this revision to Diff 313315.Dec 22 2020, 5:53 AM

Address Pengfei's comments.

LuoYuanke updated this revision to Diff 313326.Dec 22 2020, 6:47 AM

Rebase and fix lit test case failure.

pengfei added inline comments.Dec 22 2020, 6:57 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
151

Why don't put it in DeadBitcasts?

153

Can we leave the canonicalize bitcast cases a single patch. It's a bit complex here and I don't think it's a common case.

161

Maybe better to use BitCastInst?

307

This comment is for above code? Better move it up.

LuoYuanke added inline comments.Dec 22 2020, 3:20 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
153

Ok, I'll create another patch for it.

161

There may be dead load or store instructions.

202

Good catch. I'll add support for this pattern.

234–235

That's good. Thanks.

llvm/test/CodeGen/X86/AMX/amx-across-func.ll
89–91

I'll create a separate patch to clean the attributes.

llvm/test/CodeGen/X86/AMX/amx-type.ll
67

If the stride is 64 we can transform the code to memcpy. How about do it in another patch?

102

It can check the load and store instruction is not transformed if they are not participate in amx operation. I prefer to keep the case.

LuoYuanke updated this revision to Diff 313458.Dec 22 2020, 5:44 PM

Address Pengfei's comments.

LuoYuanke added inline comments.Dec 22 2020, 10:04 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
261

In my test case, it is transformed after Combine redundant instructions.

*** IR Dump After Simplify the CFG ***
define internal fastcc void @_ZL12__tile_loaddP15__tile1024i_strPKvm(%struct.__tile1024i_str* nocapture %dst) unnamed_addr #4 {
entry:
  %row = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 0
  %0 = load i16, i16* %row, align 64, !tbaa !2
  %col = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 1
  %1 = load i16, i16* %col, align 2, !tbaa !7
  %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %0, i16 %1, i8* getelementptr inbounds ([1024 x i8], [1024 x i8]* @buf, i64 0, i64 0), i64 64) #6
  %3 = bitcast x86_amx %2 to <256 x i32>
  %tile = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 3
  store <256 x i32> %3, <256 x i32>* %tile, align 64, !tbaa !8
  ret void
}

To

*** IR Dump After Combine redundant instructions ***
; Function Attrs: alwaysinline nounwind uwtable mustprogress
define internal fastcc void @_ZL12__tile_loaddP15__tile1024i_strPKvm(%struct.__tile1024i_str* nocapture %dst) unnamed_addr #4 {
entry:
  %row = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 0
  %0 = load i16, i16* %row, align 64, !tbaa !2
  %col = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 1
  %1 = load i16, i16* %col, align 2, !tbaa !7
  %2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %0, i16 %1, i8* getelementptr inbounds ([1024 x i8], [1024
x i8]* @buf, i64 0, i64 0), i64 64) #6
  %tile = getelementptr inbounds %struct.__tile1024i_str, %struct.__tile1024i_str* %dst, i64 0, i32 3
  %3 = bitcast <256 x i32>* %tile to x86_amx*
  store x86_amx %2, x86_amx* %3, align 64, !tbaa !8
  ret void
}

In my test case, it is transformed after Combine redundant instructions.

Can we disable it for AMX type? The pointer to AMX type is meaningless and may result in bad perfomance.

llvm/lib/Target/X86/X86LowerAMXType.cpp
151

I don't see any chance this happen. But we still need to handle the x86_amx* here if possible, right?
Maybe better to give an assertion for now.

cast<PointerType>(Src->getType())->isX86_AMXTy()

In my test case, it is transformed after Combine redundant instructions.

Can we disable it for AMX type? The pointer to AMX type is meaningless and may result in bad perfomance.

Ok, I'll disable the transform for AMX type.

LuoYuanke updated this revision to Diff 313635.Dec 23 2020, 5:53 PM

Address Pengfei's comments.

LuoYuanke updated this revision to Diff 313639.Dec 23 2020, 6:19 PM

Improve the comments.

In my test case, it is transformed after Combine redundant instructions.

Can we disable it for AMX type? The pointer to AMX type is meaningless and may result in bad perfomance.

Ok, I'll disable the transform for AMX type.

Good job.

llvm/lib/Target/X86/X86LowerAMXType.cpp
47

Currently, we don't have HW type for v256i32. I think 64 bytes(512bits) should be enough here.

122–148

Value

124

How about the Tile comes from tdpbssd?

171

We don't need to align to 1024. 64 should be enough. The same for below comments.

195

How about the Tile comes from tdpbssd?

LuoYuanke added inline comments.Dec 23 2020, 11:19 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
124

We have a convention, when amx intrinsics define a x86_amx tile the first 2 operands is the shape of the defined tile. For tdpbssd, the intrinsics operands are (m, n, k, ...). (m, n) is the shape of the produced tile.

pengfei added inline comments.Dec 23 2020, 11:35 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
124

Oh, yes. I missed that. Thanks.

242

vector

311

Why we need to recursively delete them? I think delete the nodes in DeadInsts is enough.

Address Pengfei's comments.

Refine comments.

LuoYuanke marked 2 inline comments as done.Dec 24 2020, 12:30 AM
pengfei accepted this revision.Dec 24 2020, 12:34 AM

LGTM. Thanks for the refactors. Maybe better to wait for a few days to see if others have objections.

This revision is now accepted and ready to land.Dec 24 2020, 12:34 AM

LGTM. Thanks for the refactors. Maybe better to wait for a few days to see if others have objections.

Thank Pengfei for the review. Sure, I'll wait for a few days.

This revision was landed with ongoing or failed builds.Dec 29 2020, 9:52 PM
This revision was automatically updated to reflect the committed changes.
uabelho added inline comments.
llvm/include/llvm/IR/Type.h
68

This addition causes a compilation warning in HexagonTargetObjectFile.cpp:

../lib/Target/Hexagon/HexagonTargetObjectFile.cpp:297:11: error: enumeration value 'X86_AMXTyID' not handled in switch [-Werror,-Wswitch]
  switch (Ty->getTypeID()) {
          ^
1 error generated.

Seen in build bots, e.g. here:
http://lab.llvm.org:8011/#/builders/57/builds/2889/steps/6/logs/stdio

pengfei added inline comments.Dec 30 2020, 6:35 AM
llvm/include/llvm/IR/Type.h
68

Thanks Mikael for pointing it out. I think we just need to put the type in the switch table. I've posted a patch to fix it. rG16c2067cf212.

uabelho added inline comments.Dec 30 2020, 6:48 AM
llvm/include/llvm/IR/Type.h
68

Yep, thanks!

MaskRay added a subscriber: MaskRay.EditedDec 30 2020, 9:51 AM

D93944 fixed an llvm-c-test issue. Note, adding new enum members usually requires check-all (at least check-llvm, but Clang may use these enum as well) because they can be used everywhere.

cuviper added inline comments.
llvm/include/llvm-c/Core.h
163

This is a breaking change to the C ABI -- can we move it to the end of the enum?
https://bugs.llvm.org/show_bug.cgi?id=48905

MaskRay added inline comments.Jan 27 2021, 4:32 PM
llvm/include/llvm-c/Core.h
163
LuoYuanke added inline comments.Jan 27 2021, 5:20 PM
llvm/include/llvm-c/Core.h
163

@MaskRay, thank you!

ychen added a subscriber: ychen.Mar 17 2021, 9:12 PM