This is an archive of the discontinued LLVM Phabricator instance.

[X86] Fix tile config register spill issue.
ClosedPublic

Authored by pengfei on Jan 21 2021, 7:23 AM.

Details

Summary

This is an optimized approach for D94155.

Previous code build the model that tile config register is the user of
each AMX instruction. There is a problem for the tile config register
spill. When across function, the ldtilecfg instruction may be inserted
on each AMX instruction which use tile config register. This cause all
tile data register clobber.

To fix this issue, we remove the model of tile config register. We
analyze the AMX instructions between one call to another. We will insert
ldtilecfg after the first call if we find any AMX instructions.

We also optimized the inserting of tilerelease by moving it just after
the last AMX instruction of each branch. We will insert it to its
successors if the BB contains the last AMX instruction is in a loop.

Diff Detail

Event Timeline

pengfei created this revision.Jan 21 2021, 7:23 AM
pengfei requested review of this revision.Jan 21 2021, 7:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2021, 7:23 AM
pengfei updated this revision to Diff 318352.Jan 21 2021, 4:31 PM

Fix Lint warnings.

xiangzhangllvm added inline comments.Jan 21 2021, 9:53 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
321

The Algorithm complexity is O(N*N)

llvm/lib/Target/X86/X86PreTileConfig.cpp
321

The Algorithm complexity is O(N*N)

321

The Algorithm complexity is O(N*N)

I think we can do this in 1 DFS.

llvm/lib/Target/X86/X86PreTileConfig.cpp
281

If the call at the bottom of the BB, we may need insert tile_ld_config too.

pengfei added inline comments.Jan 21 2021, 11:57 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
281

Good catch! Thanks.

321

The Algorithm complexity is O(N*N)

I think we can do this in 1 DFS.

The Algorithm complexity is O(N) ideally. E.g. Each BB has at least one call. If not, the BB (and its successor) might be iterated more than once. I can improve it future.

LuoYuanke added inline comments.Jan 22 2021, 12:42 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
269

WorkList and LastAMXSet may be a class member to avoid parameter passing.

LuoYuanke added inline comments.Jan 22 2021, 5:01 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
279

This is not clear to me. In below example, do we only insert tilerelease in amx1? Does call2 have chance to be iterated? We may add more test cases for different scenario.

       call0
    /        \
amx1       amx2
  call1       call2
pengfei updated this revision to Diff 318743.Jan 23 2021, 4:10 AM
  1. Address review comments from Xiang and Yuanke.
  2. Optimize the algorithm to a single DFS.
  3. Fix bugs when counting the place for tilerelease.
pengfei edited the summary of this revision. (Show Details)Jan 23 2021, 4:11 AM
pengfei marked 2 inline comments as done.Jan 23 2021, 4:17 AM
pengfei added inline comments.
llvm/lib/Target/X86/X86PreTileConfig.cpp
279

The previous code doesn't have these problem. Anyway, I changed the algorithm. It should be more clear now. Could you have a new review?

pengfei updated this revision to Diff 318747.Jan 23 2021, 5:06 AM

Remove an unnecessary ++MII;

pengfei updated this revision to Diff 318749.Jan 23 2021, 5:28 AM

A minor refactor.

pengfei updated this revision to Diff 318754.Jan 23 2021, 7:51 AM

This patch reduces algorithm complexity of function findNeedInsertBB, while increases a bit code complexity compared with previous one.

pengfei updated this revision to Diff 318758.Jan 23 2021, 8:11 AM

A minor change.

LuoYuanke added inline comments.Jan 23 2021, 4:10 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
354

It seems a lot of change in the patch. If you drop tilerelease, will the patch be simpler. I would like solve tile config issue first and move forward.

pengfei added inline comments.Jan 23 2021, 6:48 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
354

Yes, it will be. But the former BFS + DFS algorithm is still complex than the single DFS one, which means we still need major changes.
I would like to keep the tilerelease handling here. I didn't find bug of it for now and the calculating math is straightforward. I can show you the 4 status in below:

    ______________________________
S3 |begin ... AMX ... call ... end|
    ------------------------------
    ______________________________
S2 |begin ... call ... AMX ... end|
    ------------------------------
           _____________
S1        |begin ... end|
           -------------
          _/_       _\_
   max(S(|BB0|), S(|BB1|), ...) >= 1
          ---       ---

S0 is similar with S1 where all its successors' status are all 0. It doesn't matter if there's loop in the graph. Because when calculating the status, we assign 0 to current BB at first (if there's no AMX in it). Then we will iterate all its successors. It turns to 1 if one of other successors is not 0, and keeps 0 if all other successors are 0.
We insert tilerelease after the last AMX of current BB only if all its successors' status are 0. Otherwise, we will iterate all its first level successors and insert tilerelease at the begining of BB whoes status is 0. We also need to consider the BB whoes status is 1. We will recursively find their successors and insert tilerelease for BBs whoes status is 0.
All BB will be calculated for tilerelease inserting case once at most, since we record them in BBNeedInsertVisited. There's no big algorithm complexity increasement here.

LuoYuanke added inline comments.Jan 23 2021, 8:55 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
334

May add more test case for tilerelease. We may insert multi tilerelease in different block, better to have test case to cover that.

351

The 4 state may be a enum to be more readable.

pengfei updated this revision to Diff 318822.Jan 24 2021, 3:31 AM

Adress Yuanke's comments.

pengfei marked 2 inline comments as done.Jan 24 2021, 3:32 AM
LuoYuanke added inline comments.Jan 24 2021, 4:02 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
364

This confused me. This looks check call after amx. Should we instead check if there is amx instruction after call to determine if ldtilecfg is needed?

pengfei updated this revision to Diff 318833.Jan 24 2021, 5:00 AM
pengfei marked an inline comment as done.

Address Yuanke's comment. Moving the insertion to check AMX block.

pengfei added inline comments.Jan 24 2021, 5:03 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
364

I moved it to AMX checking block now.
The previous code try to reduce multi insertion by puting it in the call checking block.

pengfei updated this revision to Diff 318842.Jan 24 2021, 7:09 AM

Fix a missing case in function updateBBStatus.

pengfei updated this revision to Diff 318890.Jan 24 2021, 6:14 PM

Small refactor: Change recursive DFS to iterative DFS when insert tilerelease.

LuoYuanke added inline comments.Jan 24 2021, 8:21 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
264

I'm not sure there is only one exit.

360

Not sure about this condition. If there is no call in its direct successor, but there is amx before call in its successor's successor.

      call
    /      \
BB1      BB2
  |
 amx
 call
367

I didn't figure out this function yet. @xiangzhangllvm, would you help to review this function?

xiangzhangllvm added inline comments.Jan 25 2021, 2:21 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
260

We just need to collect insert point after call which appended by AMX instruction,
I suggest a clear O(N) logic in 1 DFS:

ReloadTileConfig (MBB ) {  // stark from tilecfg MBB
     For each MI in MBB {
        CallNum = 0;
        if MI is Call 
            push Call into CallStack
            ++CallNum;
        if MI is AMX
            Mark the Call in Top of CallStack is NeedInsert;
      }
      For MBBS in MBB->Succs
          ReloadTileConfig (MBB )
      While CallNum--
          if CallStack .top is marked NeedInsert
             Save CallStack .top into NeedInsertPosSet //NeedInsertPosSet will collect all the insert points
          CallStack.pop 
}
pengfei marked an inline comment as done.Jan 25 2021, 8:52 AM
pengfei updated this revision to Diff 319031.Jan 25 2021, 8:53 AM

Fix a bug when calculating indirect successors.

pengfei updated this revision to Diff 319470.Jan 26 2021, 11:16 PM

Reduce BB status to 3 cases, simplify code based on it.

pengfei updated this revision to Diff 319528.Jan 27 2021, 4:24 AM

A minor refactor.

pengfei updated this revision to Diff 319556.Jan 27 2021, 6:42 AM

Change enum names to make them more accurate.

xiangzhangllvm added inline comments.Jan 27 2021, 5:51 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
283

I am not sure the order of iterate BBVisitedInfo is right.
In map it may use "<" to sort its elements.

292

What does HasBeforeCallAMX really mean, why set it once BB has AMX ?

298

Not sure directly continue is right, if the status required by the visited predecessors.

pengfei added inline comments.Jan 27 2021, 7:14 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
283

The order doesn't matter. Just make sure all BBs are iterated.

292

It will be updated soon in line 293. We can assume it alway has AMX before call first. Then together with no AMX case, we check if there's call before AMX. If there is, we update it to HasAfterCallAMX.

298

We don't need to care about the BB which is not found when we do searching from top to bottom.
Becaused these BBs are never to be a successor of any AMX instruction.
We might meet them sometimes since we are searching from bottom to top. Ignoring them is safe.

pengfei updated this revision to Diff 319777.Jan 28 2021, 12:10 AM

Remove the code of inserting tilerelease for easy review.

xiangzhangllvm added inline comments.Jan 28 2021, 3:41 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
229

LGTM for the logic.

235

Can use a better name for HasCallBeforeAMX? it is very easy to mix with HasAfterCallAMX when I read here

283

The logic here is complex, better to add clear comment to show how we update the Status for BBs.

pengfei updated this revision to Diff 319830.Jan 28 2021, 4:55 AM

Address Xiang's comments.

LuoYuanke added inline comments.Jan 28 2021, 4:56 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
230

Better to have a diagram comments to illustrate the state.

235

Agree. It is easy to mix with HasBeforeCallAMX.

271

!WorkList.empty()?

283

WorkList should be clear before this loop?

286

(!WorkList.empty()) seems more readable.

297

Not quite understand this. Isn't all BBs added at line 275?

302

Don't understand this. Can you add a diagram to illustrate it?

pengfei marked an inline comment as done.Jan 28 2021, 4:56 AM
pengfei added inline comments.
llvm/lib/Target/X86/X86PreTileConfig.cpp
229

Thanks for the review.

235

Should IsCallBeforeAMX better?

LuoYuanke added inline comments.Jan 28 2021, 5:00 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
252

What about multi call and multi AMX in one BB? For below example, should we insert ldtilecfg after call1?

call1
AMX1
call2
AMX2

LuoYuanke added inline comments.Jan 28 2021, 5:03 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
235

Maybe HasCallBeforeAMXInBB.

pengfei added inline comments.Jan 28 2021, 5:26 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
235

HasCallBeforeAMXInBB is still easy mix up with HasAfterCallAMX, besides, we need to mark it true even there's no AMX in the BB.

252

Yes, we insert for call1 and call2 in line 247.
We only add the BB to BBUnsolved if there's no AMX after the last call.

297

No, we only added successors since ldtilecfg BB. BB0, BB1 and BB2 in below graph won't be added.

  BB0   ldtilecfg
   |      / \
  BB1   BB4 BB5
  / \   /   /
BB2  BB3  BB6
   \  |  /
     BB7
LuoYuanke added inline comments.Jan 28 2021, 5:42 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
297

There is no AMX instruction in BB0, BB1, BB2, so we don't care about BB0, BB1, BB2?

pengfei added inline comments.Jan 28 2021, 5:59 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
297

Yes. BB ldtilecfg dominates all BB that have AMX. So, there's no AMX in BB0, BB1 and BB2. In fact, BB3 and BB7 don't have either.
We calculate the status here which are used by the BB in BBUnsolved, which will never use the status of BB0~2. Nevertheless, we know their stutus are always NoAMXAtAll.

pengfei updated this revision to Diff 319848.Jan 28 2021, 6:25 AM

Address Yuanke's comments.

pengfei marked 3 inline comments as done.Jan 28 2021, 6:28 AM
pengfei added inline comments.
llvm/lib/Target/X86/X86PreTileConfig.cpp
283

Not necessary. But adding it seems more robust.

302

Add some comments on line 290.

LuoYuanke added inline comments.Jan 29 2021, 3:50 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
298

State HasAfterCallAMX seems useless. I didn't find we check this state anywhere.

306

Why if BBVisitedInfo[*I].HasAMX is true, we don't update it? I think when BBVisitedInfo[*I].MaxSucc changes, we need update its predecessor.

pengfei added inline comments.Jan 29 2021, 7:28 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
298

Yes, it was used for inserting tilerelease. Now we just need 2 status, I simplified code base it.

306

It is just an optimization.
If *I has AMX, the status it contributes to its predecessors won't be changed with its MaxSucc, e.g.
When we have a BB which has AMX and its MaxSucc is S0, it updates its predecessors with S2 in the first time. So next time, when its MaxSucc is changed to S1 by its successor, it still updates its predecessors with S2. This is not necessary, so it is skipped.
Anyway, since we are using 2 status now, this code is also been simplified.

pengfei updated this revision to Diff 320131.Jan 29 2021, 7:29 AM

Reduce one unused status. Simplify code based on it.

LuoYuanke accepted this revision.Jan 29 2021, 4:42 PM

LGTM. Thank you!

This revision is now accepted and ready to land.Jan 29 2021, 4:42 PM
xiangzhangllvm added inline comments.Jan 29 2021, 8:11 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
229

Not try to replace this patch, I update the O(N) algorithm in https://reviews.llvm.org/D95508 just for interest.
In Non-loop case, it cost O(N), in loop case, it cost O(2N)

This revision was landed with ongoing or failed builds.Jan 29 2021, 8:54 PM
This revision was automatically updated to reflect the committed changes.
pengfei added inline comments.Jan 29 2021, 9:12 PM
llvm/lib/Target/X86/X86PreTileConfig.cpp
229

So it is related on the number of loops? But in the worst case, a graph may have O(N!) loops? The worst complexity is terrible.

xiangzhangllvm added inline comments.Jan 30 2021, 1:02 AM
llvm/lib/Target/X86/X86PreTileConfig.cpp
229

So it is not related on the number of loops, in loop case the max cost is O(2N), (we see O(2N) == O(N) in program)