This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMX] combine tile cast and load/store instruction.
ClosedPublic

Authored by LuoYuanke on Apr 25 2022, 4:33 AM.

Details

Summary

The llvm.x86.cast.tile.to.vector intrinsic is lowered to
llvm.x86.tilestored64.internal and load <256 x i32>. The
llvm.x86.cast.vector.to.tile is lowered to store <256 x i32> and
llvm.x86.tileloadd64.internal. When llvm.x86.cast.tile.to.vector is
used by store <256 x i32> or load <256 x i32> is used by
llvm.x86.cast.vector.to.tile, they can be combined by
llvm.x86.tilestored64.internal and llvm.x86.tileloadd64.internal.

Diff Detail

Event Timeline

LuoYuanke created this revision.Apr 25 2022, 4:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 25 2022, 4:33 AM
LuoYuanke requested review of this revision.Apr 25 2022, 4:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 25 2022, 4:33 AM
LuoYuanke updated this revision to Diff 424886.Apr 25 2022, 5:43 AM

Combine load/store after eliminate redundant cast.

LuoYuanke updated this revision to Diff 424888.Apr 25 2022, 6:05 AM

Remove unused cast.

Combine usually make sense. Not sure the case load/store has special attributes, e.g. volatile. The new amx load/store may miss them.

llvm/lib/Target/X86/X86LowerAMXType.cpp
947

return

Here has chance to get shape for other user.

994

How about the Cast has other non load users

LuoYuanke added inline comments.Apr 25 2022, 11:52 PM
llvm/lib/Target/X86/X86LowerAMXType.cpp
947

Yes, how about add TODO for it?

994

In line 985, it has checked load only has one use. The cast can have more than one users, compiler would replace all the cast user with tileload instruction.

Rebase and fix test case failure.

LuoYuanke added inline comments.Apr 26 2022, 12:46 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
949

I just realize that the tileload should insert before load instruction instead of cast. If there is store between load and cast, this transform is not correct.

Fix bug for cast across store.

xiangzhangllvm added inline comments.Apr 26 2022, 1:15 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
947

No problem.

994

Yes, there is no problem if we can make sure the Cast is only used by Load(s) .

This revision is now accepted and ready to land.Apr 26 2022, 5:43 PM
This revision was landed with ongoing or failed builds.Apr 28 2022, 12:08 AM
This revision was automatically updated to reflect the committed changes.
yubing added inline comments.Jun 15 2023, 12:21 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
930

Why stride is 64 here instead of Col?

LuoYuanke added inline comments.Jun 15 2023, 2:09 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
930

Both 64 and Col should work as long as load/store keep the same stride value, but 64 is constant, so it is prefered.

yubing added inline comments.Jun 15 2023, 2:13 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
930

how about the following IR:
%tile = call x86_amx @llvm.x86.tileloadd64.internal(i16 8, i16 32, i8* %src_ptr, i64 64)
%vec = call <256 x i8> @llvm.x86.cast.tile.to.vector.v256i8(x86_amx...%tile)
store <256 x i8> %vec, <256 x i8>* %dst_ptr, align 256

if you combine into:
%tile = call x86_amx @llvm.x86.tileloadd64.internal(i16 8, i16 32, i8* %src_ptr, i64 64)
call void @llvm.x86.tilestored64.internal(i16 8, i16 32, i8* %dst_ptr, i64 64, x86_amx %tile)

definitely it will out of bound.

LuoYuanke added inline comments.Jun 15 2023, 2:32 AM
llvm/lib/Target/X86/X86LowerAMXType.cpp
930

Why there is <256 x i8>? Shouldn't the tile size be <256 x i32> which is 1024 bytes?