This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][GlobalISel] Fixup <32b heterogeneous regbanks of G_PHIs just before selection.
ClosedPublic

Authored by aemerson on Feb 24 2020, 2:58 PM.

Details

Summary

Since all types <32b on gpr end up being assigned gpr32 regclasses, we can end up with PHIs here which try to select between a gpr32 and an fpr16. Ideally RBS shouldn't be selecting heterogenous regbanks for operands if possible, but we still need to be able to deal with it here.

To fix this, if we have a gpr-bank operand < 32b in size and at least one other operand is on the fpr bank, then we add cross-bank copies to homogenize the operand banks. For simplicity the bank that we choose to settle on is whatever bank the def operand has. For example:

%endbb:
      %dst:gpr(s16) = G_PHI %in1:gpr(s16), %bb1, %in2:fpr(s16), %bb2
     =>
%bb2:
      ...
      %in2_copy:gpr(s16) = COPY %in2:fpr(s16)
      ...
%endbb:
      %dst:gpr(s16) = G_PHI %in1:gpr(s16), %bb1, %in2_copy:gpr(s16), %bb2

Diff Detail

Event Timeline

aemerson created this revision.Feb 24 2020, 2:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2020, 2:58 PM

Hi Amara,

Why do you have to fix-up the PHIs before selection?

I would expect G_PHI to be selected to PHI (i.e., just a opcode change) and then the phi elimination pass will insert the cross copies for you.

Where is this going wrong?

Cheers,
-Quentin

I would expect G_PHI to be selected to PHI (i.e., just a opcode change) and then the phi elimination pass will insert the cross copies for you.

Where is this going wrong?

Thanks for taking a look. Yes the G_PHI gets selected to PHI. This is normally ok even in the case where we have operands with different regbanks. However on AArch64 this doesn't work when the type size is less than 32 bits. If this happens, the PHI operand with a GPR RegBank gets the gpr32 regclass assigned to it (we don't have anything smaller to represent types like s16 on GPR on AArch64). Then the FPR bank operand gets the appropriately sized FPR16 regclass assigned. These two registers have different sizes, and copyPhysReg doesn't know how to deal with copies between a gpr32 and an fpr16 or vice versa. I thought about maybe teaching copyPhysReg to deal with that, but the fact that we have to deal with subregisters (because of the differing sizes) makes me nervous and didn't seem like the best place to fix it.

I also tried to fix RegBankSelect to add fixups if we detect this case but this is an AArch64 specific issue, and it messed up a lot of existing PHI handling (which assumes we can just look at one operand to decide the RegBank).

Forgot to answer your first question:

Why do you have to fix-up the PHIs before selection?

The reason I'm doing this before selection is that by the time we get to a G_PHI, if it's part of a cycle, then an operand register will also be a selected instruction. If I do it before selection, then they should still have generic types and regbanks, and adding COPY instructions is simple because the existing COPY selection code can be harnessed to generate the right SUBREG_TO_REG/whatever.

Thanks for the clarification Amara.

Now a follow-up question:

If this happens, the PHI operand with a GPR RegBank gets the gpr32 regclass assigned to it (we don't have anything smaller to represent types like s16 on GPR on AArch64).

How do you represent s16 values into gpr32?

I would have expected that if you have to widen the size when assigning the actual register class, then legalization should have actually widen the value. Otherwise, that means we have to manually deal with "dangling" bits everywhere else.

Thanks for the clarification Amara.

Now a follow-up question:

If this happens, the PHI operand with a GPR RegBank gets the gpr32 regclass assigned to it (we don't have anything smaller to represent types like s16 on GPR on AArch64).

How do you represent s16 values into gpr32?

s16 operations, if legal, just get selected to gpr32 regclass instructions. E.g.

%ld:gpr(s16) = G_LOAD ...
=>
%ld:gpr32 = LDR_16 ...

For things like G_ADD we legalize to s32, but s16 itself as a type isn't illegal.

I would have expected that if you have to widen the size when assigning the actual register class, then legalization should have actually widen the value. Otherwise, that means we have to manually deal with "dangling" bits everywhere else.

qcolombet accepted this revision.Feb 26 2020, 11:55 AM

s16 operations, if legal, just get selected to gpr32 regclass instructions. E.g.

%ld:gpr(s16) = G_LOAD ...
=>
%ld:gpr32 = LDR_16 ...

That's interesting, because that means you have inreg extension where you would have regular extension in the IR.
E.g.,

s16 = G_LOAD
s32 = G_ZEXT s16

>

gpr32 = LDR_16
gpr32 = G_ZEXT gpr32

I would have expected that instead we would have combined loads with the related extension to from an extended load.

Anyway, I guess this ship has sailed.

LGTM.

This revision is now accepted and ready to land.Feb 26 2020, 11:55 AM

s16 operations, if legal, just get selected to gpr32 regclass instructions. E.g.

%ld:gpr(s16) = G_LOAD ...
=>
%ld:gpr32 = LDR_16 ...

That's interesting, because that means you have inreg extension where you would have regular extension in the IR.
E.g.,

s16 = G_LOAD
s32 = G_ZEXT s16

>

gpr32 = LDR_16
gpr32 = G_ZEXT gpr32

I would have expected that instead we would have combined loads with the related extension to from an extended load.

Yes, on AArch64 all sub-32bit loads give a free ZEXT. We catch this case in the selector if for some reason we have a redundant ZEXT.

Anyway, I guess this ship has sailed.

LGTM.

Something just stroke me, why don't you use the FPR bank for all 16-bit values then?

Put differently I feel this is a work around other issue, either legalization or reg bank select. Bottom line, I really don't like this patch, but I let you judge what is the best way to move forward.

This revision was automatically updated to reflect the committed changes.

I committed before I saw your message.

Something just stroke me, why don't you use the FPR bank for all 16-bit values then?

Off the top of my head: we would lose the distinction between a gpr and fpr load. For example, the following:

define i16 @bitwise(i16 *%a, i16 *%b) {
  %res = load i16, i16 *%a
  %x = xor i16 %res, 12345
  store i16 %x, i16 *%b
  ret i16 %x
}

would end up having RegBank assignments like this:

body:             |
  bb.1 (%ir-block.0):
    liveins: $x0, $x1
  
    %0:gpr(p0) = COPY $x0
    %1:gpr(p0) = COPY $x1
    %2:fpr(s16) = G_LOAD %0(p0) :: (load 2 from %ir.a)
    %copy:gpr(s16) = COPY %2(s16) ; Need a cross-bank copy here because of the ANYEXT+XOR.
    %6:gpr(s32) = G_ANYEXT %copy(s16)
    %7:gpr(s32) = G_CONSTANT i32 12345
    %8:gpr(s32) = G_XOR %6, %7
    %4:gpr(s16) = G_TRUNC %8(s32)
    %copy2:fpr(s16) = COPY %4(s16) ; Likewise here to get back to an fpr s16 value.
    G_STORE %copy2(s16), %1(p0) :: (store 2 into %ir.b)
    %5:gpr(s32) = COPY %8(s32)
    $w0 = COPY %5(s32)
    RET_ReallyLR implicit $w0

This results in cross-bank copies which are bad for performance. We can't eliminate s16 GPRs completely because things like trunc/extend need to be able to deal with them, as well as G_EXTRACT/INSERT. Maybe there's a way to deal with those issues too, I'm not sure.

Put differently I feel this is a work around other issue, either legalization or reg bank select. Bottom line, I really don't like this patch, but I let you judge what is the best way to move forward.

I think this ultimately stems from the fact that we lost fp/integer type information in the translator and trying to patch this back together from defs and uses is prone to failure. For performance reasons it there's probably no good reason to have G_PHIs with separate regbanks, but RBS is difficult to change. Not only that but in this case if you have a cyclic phi you can't even tell completely whether or not you have this problem because not every user has had a RegBank assigned yet.

I understand that cross copies are bad, but what I don't understand is why don't we form an extended load? That would apply nicely to your example.

Put differently, (I am being pedantic just for the sake of the argument), I would argue that:
s16 = G_LOAD
maps to
fpr16 = LDR

not
gpr32 = LDR16
since we're effectively sneaking in an extension in that load.

gpr32 = LDR16 would better match ANYEXT(G_LOAD).

All in all I understand your solution and why you're doing that. I am afraid we are actually building technical debt here, but I trust your judgement on what you think is best to move forward here.

To go back on RBS, we could "fix" it to issue only PHIs with the same bank, but effectively, we would just move the copies around and basically still have the weirdly sized cross copies [1] appearing as COPY instead of PHIs and in different blocks.

Like you noted with PHIs in loops, ideally we would like to have RBS to make its decision globally. That would indeed require some work and also, I believe that would only make the weird sized copies less likely to happen but not impossible. Therefore I really think this is a legalization issue.

[1] they are only weirdly sized after selection which is why I think we are actually not modeling this right :)

I understand that cross copies are bad, but what I don't understand is why don't we form an extended load? That would apply nicely to your example.

I see what you mean. Essentially we would take the code size/perf hit with the specific MIR example I used, but instead try harder to form the extending load ops and ensure those are assigned to gpr banks. There might be other issues with G_EXTRACT but we'll see.

Since this patch fixes a compiler crash I'm going to leave it in, but I'll investigate this approach as a follow up.

Put differently, (I am being pedantic just for the sake of the argument), I would argue that:
s16 = G_LOAD
maps to
fpr16 = LDR

not
gpr32 = LDR16
since we're effectively sneaking in an extension in that load.

gpr32 = LDR16 would better match ANYEXT(G_LOAD).

All in all I understand your solution and why you're doing that. I am afraid we are actually building technical debt here, but I trust your judgement on what you think is best to move forward here.

To go back on RBS, we could "fix" it to issue only PHIs with the same bank, but effectively, we would just move the copies around and basically still have the weirdly sized cross copies [1] appearing as COPY instead of PHIs and in different blocks.

Like you noted with PHIs in loops, ideally we would like to have RBS to make its decision globally. That would indeed require some work and also, I believe that would only make the weird sized copies less likely to happen but not impossible. Therefore I really think this is a legalization issue.

[1] they are only weirdly sized after selection which is why I think we are actually not modeling this right :)