This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMX] Fix the shape dependency issue.
ClosedPublic

Authored by LuoYuanke on Nov 13 2022, 7:45 PM.

Details

Summary

AMX shape should be defined before AMX intrinsics. However for below
case, the shape a.row is defined after tile load of b. If we transform
load b to @llvm.x86.tileloadd64 intrinsic, the shape dependency
doesn't meet.

void test_tile_dpbsud(__tile1024i a, __tile1024i b, __tile1024i c) {
  __tile_dpbsud(&c, a, b);
}

This patch is to store the tile b to stack and reloaded it after the
def of b.row. It would cause redundant store/load, but it is simple
to avoid generating invalid IR.
The better way may hoist def b.row before tile load instruction,
but it seems more complicated to recursively hoist its operands.

Diff Detail

Event Timeline

LuoYuanke created this revision.Nov 13 2022, 7:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 13 2022, 7:45 PM
LuoYuanke requested review of this revision.Nov 13 2022, 7:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 13 2022, 7:45 PM

Maybe it is time to re-review the previous way for implement "let share define before use".
e.g if let frond end do same "make sure" is more beautiful.

Maybe it is time to re-review the previous way for implement "let share define before use".
e.g if let frond end do same "make sure" is more beautiful.

Yes, if it can be prevented in FE, that's great. I try to use memory barriar as below code, but it doesn't work. Welcome for ideas.

static __inline__ void __tile_dpbsud(__tile1024i *dst, __tile1024i src0,
                                     __tile1024i src1) {
  short m = src0.row;
  short n = src1.col;
  short k = src0.col;
  asm volatile ("" : : : "memory");
  dst->tile = _tile_dpbsud_internal(m, n, k, dst->tile,
                                    src0.tile, src1.tile);
}
static __inline__ void __tile_dpbsud(__tile1024i *dst, __tile1024i src0,
                                     __tile1024i src1) {
  short m = src0.row;
  short n = src1.col;
  short k = src0.col;
  func_use(m, n, k, ...);  // set it as a scheduler boundary and emit nothing for this special variable parameter func. ?
  dst->tile = _tile_dpbsud_internal(m, n, k, dst->tile,
                                    src0.tile, src1.tile);
}
static __inline__ void __tile_dpbsud(__tile1024i *dst, __tile1024i src0,
                                     __tile1024i src1) {
  short m = src0.row;
  short n = src1.col;
  short k = src0.col;
  func_use(m, n, k, ...);  // set it as a scheduler boundary and emit nothing for this special variable parameter func. ?
  dst->tile = _tile_dpbsud_internal(m, n, k, dst->tile,
                                    src0.tile, src1.tile);
}

I did an experiment on your proposal. The result is the same to memory barrier. It can't prevent reordering for shape load and tile load.

How about merge the " load + cast" to the cast position not load.
for example generate the tileload for line 95 105 to line 105:

 89 *** IR Dump After Lower AMX intrinsics (lower-amx-intrinsics) ***
 90 define void @test_tile_dpbssd(ptr byval(%struct.__tile1024i_str) align 64 %a, ptr byval(%struct.__tile1024i_str) align 64 %b, ptr byval(%struct.__tile1024i_str) alig    n 64 %c) {
 91 entry:
 92   %b.row.ptr = getelementptr inbounds i8, ptr %b, i64 2
 93   %b.row = load i16, ptr %b.row.ptr, align 2
 94   %b.tile.ptr = getelementptr inbounds i8, ptr %b, i64 64
 95   %b.tile = load <256 x i32>, ptr %b.tile.ptr, align 64
 96   %a.row = load i16, ptr %a, align 64
 97   %a.col.ptr = getelementptr inbounds i8, ptr %a, i64 2
 98   %a.col = load i16, ptr %a.col.ptr, align 2
 99   %a.tile.ptr = getelementptr inbounds i8, ptr %a, i64 64
100   %a.tile = load <256 x i32>, ptr %a.tile.ptr, align 64
101   %c.tile.ptr = getelementptr inbounds %struct.__tile1024i_str, ptr %c, i64 0, i32 3
102   %c.tile = load <256 x i32>, ptr %c.tile.ptr, align 64
103   %c.amx = tail call x86_amx @llvm.x86.cast.vector.to.tile.v256i32(<256 x i32> %c.tile)
104   %a.amx = tail call x86_amx @llvm.x86.cast.vector.to.tile.v256i32(<256 x i32> %a.tile)
105   %b.amx = tail call x86_amx @llvm.x86.cast.vector.to.tile.v256i32(<256 x i32> %b.tile)
106   %res = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 %a.row, i16 %b.row, i16 %a.col, x86_amx %c.amx, x86_amx %a.amx, x86_amx %b.amx)
107   ret void
108 }

How about merge the " load + cast" to the cast position not load.
for example generate the tileload for line 95 105 to line 105:

 89 *** IR Dump After Lower AMX intrinsics (lower-amx-intrinsics) ***
 90 define void @test_tile_dpbssd(ptr byval(%struct.__tile1024i_str) align 64 %a, ptr byval(%struct.__tile1024i_str) align 64 %b, ptr byval(%struct.__tile1024i_str) alig    n 64 %c) {
 91 entry:
 92   %b.row.ptr = getelementptr inbounds i8, ptr %b, i64 2
 93   %b.row = load i16, ptr %b.row.ptr, align 2
 94   %b.tile.ptr = getelementptr inbounds i8, ptr %b, i64 64
 95   %b.tile = load <256 x i32>, ptr %b.tile.ptr, align 64
 96   %a.row = load i16, ptr %a, align 64
 97   %a.col.ptr = getelementptr inbounds i8, ptr %a, i64 2
 98   %a.col = load i16, ptr %a.col.ptr, align 2
 99   %a.tile.ptr = getelementptr inbounds i8, ptr %a, i64 64
100   %a.tile = load <256 x i32>, ptr %a.tile.ptr, align 64
101   %c.tile.ptr = getelementptr inbounds %struct.__tile1024i_str, ptr %c, i64 0, i32 3
102   %c.tile = load <256 x i32>, ptr %c.tile.ptr, align 64
103   %c.amx = tail call x86_amx @llvm.x86.cast.vector.to.tile.v256i32(<256 x i32> %c.tile)
104   %a.amx = tail call x86_amx @llvm.x86.cast.vector.to.tile.v256i32(<256 x i32> %a.tile)
105   %b.amx = tail call x86_amx @llvm.x86.cast.vector.to.tile.v256i32(<256 x i32> %b.tile)
106   %res = tail call x86_amx @llvm.x86.tdpbssd.internal(i16 %a.row, i16 %b.row, i16 %a.col, x86_amx %c.amx, x86_amx %a.amx, x86_amx %b.amx)
107   ret void
108 }

There may be instruction to modify the memory between the load and cast instrution, we need analyze it is safe to sink the load to cast instruction.

Make sense! LGTM

xiangzhangllvm accepted this revision.Nov 15 2022, 1:22 AM
This revision is now accepted and ready to land.Nov 15 2022, 1:22 AM
This revision was landed with ongoing or failed builds.Nov 15 2022, 6:50 PM
This revision was automatically updated to reflect the committed changes.