This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use shift for b64 mov
AbandonedPublic

Authored by sebastian-ne on Nov 12 2021, 8:39 AM.

Details

Summary

There is no v_mov_b64, but a v_lshlrev_b64 can accomplish the same by
shifting a 64-bit register by 0.

Diff Detail

Event Timeline

sebastian-ne created this revision.Nov 12 2021, 8:39 AM
sebastian-ne requested review of this revision.Nov 12 2021, 8:39 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 12 2021, 8:39 AM
foad added a reviewer: mjbedy.Nov 12 2021, 8:49 AM

Interesting. I see you're doing this when expanding V_MOV_B64_PSEUDO, but I don't really understand when we use V_MOV_B64_PSEUDO in the first place. copyPhysReg() does not generate it, instead it copies the logic from here to emit V_PK_MOV_B32. So does that mean you need to add your V_LSHLREV_B64_e64 code to copyPhysReg too?

64-bit shifts were quarter rate instructions last I checked, so this is slower

foad added a comment.Nov 12 2021, 8:57 AM

64-bit shifts were quarter rate instructions last I checked, so this is slower

The Write64Bit definitions in SISchedule.td suggest they are half rate on most subtargets and full rate on gfx90a.

64-bit shifts were quarter rate instructions last I checked, so this is slower

The Write64Bit definitions in SISchedule.td suggest they are half rate on most subtargets and full rate on gfx90a.

I think that's probably wrong. Comments in performShlCombine for example say it's quarter rate

64-bit shifts were quarter rate instructions last I checked, so this is slower

The Write64Bit definitions in SISchedule.td suggest they are half rate on most subtargets and full rate on gfx90a.

I think that's probably wrong. Comments in performShlCombine for example say it's quarter rate

It seems to be quarter rate (or something slow) on gfx9, full rate on gfx90a and half rate on gfx10?
Then it would be worth using on gfx90a and gfx10+.

64-bit shifts were quarter rate instructions last I checked, so this is slower

The Write64Bit definitions in SISchedule.td suggest they are half rate on most subtargets and full rate on gfx90a.

I think that's probably wrong. Comments in performShlCombine for example say it's quarter rate

It seems to be quarter rate (or something slow) on gfx9, full rate on gfx90a and half rate on gfx10?
Then it would be worth using on gfx90a and gfx10+.

You do not need this on gfx90a because there is pk_mov. It is arguably the same performance as 2 moves on gfx10.

Interesting. I see you're doing this when expanding V_MOV_B64_PSEUDO, but I don't really understand when we use V_MOV_B64_PSEUDO in the first place. copyPhysReg() does not generate it, instead it copies the logic from here to emit V_PK_MOV_B32. So does that mean you need to add your V_LSHLREV_B64_e64 code to copyPhysReg too?

Pseudo was created to deal with 64 bit immediates and fold these. It is not needed that late.

For GFX10, I don't think this is worth doing unless V_LSHLREV_B64 is full rate.
2x V_MOV_B32 in VOP1 takes the same space as V_LSHLREV_B64 in VOP3.

For GFX10, I don't think this is worth doing unless V_LSHLREV_B64 is full rate.
2x V_MOV_B32 in VOP1 takes the same space as V_LSHLREV_B64 in VOP3.

I think this is right. It can also be scheduled apart leaving room for something else to be scheduled in between. A 64 bit shift is rarely beneficial in general if you can get away without it.

sebastian-ne abandoned this revision.Nov 15 2021, 2:21 AM

Those are good arguments, thanks for your thoughts.