The llvm.x86.cast.tile.to.vector intrinsic is lowered to
llvm.x86.tilestored64.internal and load <256 x i32>. The
llvm.x86.cast.vector.to.tile is lowered to store <256 x i32> and
llvm.x86.tileloadd64.internal. When llvm.x86.cast.tile.to.vector is
used by store <256 x i32> or load <256 x i32> is used by
llvm.x86.cast.vector.to.tile, they can be combined by
llvm.x86.tilestored64.internal and llvm.x86.tileloadd64.internal.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
949 | I just realize that the tileload should insert before load instruction instead of cast. If there is store between load and cast, this transform is not correct. |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
930 | Why stride is 64 here instead of Col? |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
930 | Both 64 and Col should work as long as load/store keep the same stride value, but 64 is constant, so it is prefered. |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
930 | how about the following IR: if you combine into: definitely it will out of bound. |
llvm/lib/Target/X86/X86LowerAMXType.cpp | ||
---|---|---|
930 | Why there is <256 x i8>? Shouldn't the tile size be <256 x i32> which is 1024 bytes? |
Why stride is 64 here instead of Col?