This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Exploit the rldicl + rldicl when and with mask
ClosedPublic

Authored by steven.zhang on Dec 23 2019, 1:12 AM.

Details

Summary

If we are and the constant like 0xFFFFFFC00000, for now, we are using several instructions to generate this 48bit constant and final an "and". However, we could exploit it with two rotate instructions.

       MB          ME               MB+63-ME 
+----------------------+     +----------------------+
|0000001111111111111000| ->  |0000000001111111111111|
+----------------------+     +----------------------+
 0                    64      0                    64

Rotate left ME + 1 bit first, and then, mask it with (MB + 63 - ME, 63), finally, rotate back. Notice that, we need to round it with 64 bit for the wrapping case.

Diff Detail

Event Timeline

steven.zhang created this revision.Dec 23 2019, 1:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 23 2019, 1:12 AM

Rebase the patch. Testing with spec and get about ~4k instructions reduce.

Rebase the patch.

Sorry for the delayed comments.

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4368

andis. in .td handles 0xABCD0000 u32imm, will this makes this kind imm worse?

4380

s/64/63?

4384

ME could be 63, putting 63 + 1 here seems unreasonable.

4391

I personally think the logic for the special case is a little hard to follow.
I guess you want to convert the following imm: 0x00ffffffffff00ff,

can we first treat it as imm' 0xffffffffffff00ff? and we can call normal case like above, then we get

Val = SDValue(CurDAG->getMachineNode(PPC::RLDICL, Loc, MVT::i64, Val,
                                         getI64Imm(ME + 1, Loc),
                                         getI64Imm((MB + 63 - ME) & 63, Loc)),

SDValue Ops[] = {Val, getI64Imm(63 - ME, Loc), getI64Imm(0, Loc)};
CurDAG->SelectNodeTo(N, PPC::RLDICL, MVT::i64, Ops);

Notice that the second rldicl for above normal case has no clear operation, so its mb is 0, can we add the clear for the highest 8 bits here?
Make a call for above normal case and then clear the highest bits in the second rldicl?

4397

MB can be calculated after checking ME is valid?

4398

check last bit is 1? Is it too complicated to call countTrailingZeros?

4402

ME is always 63? Why not use 63 directly?

4420

s/32/31, s/64/63

steven.zhang marked 3 inline comments as done.Mar 13 2020, 1:26 AM
steven.zhang added inline comments.
llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4391

No. The pattern is: 0x00fff00ff000. The idea is to calculate the position of [MB1, ME1] for fff, and [MB2, ME2] for ff. Then, use two rotate clear instructions to do the calculation,

4398

The last bit doesn't necessary to be 1. We need to get the number of trailing 0's.

4402

There won't be performance difference between the two. Use ME to make the logic clear.

steven.zhang planned changes to this revision.Mar 13 2020, 2:20 AM

oops, we indeed could simplify the logic here.

Address comments.

steven.zhang marked 3 inline comments as done.Mar 17 2020, 12:28 AM
steven.zhang added inline comments.
llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4368

No. It will be caught by single rlwinm if it is the pattern that this patch matches.

4391

In fact, both works. The pattern is 0x00fff00fff. Update the patch to make it more clear.

4398

Good point. In fact, we don't need the check here as the isRunOfOnes64 will check it for us.

shchenz added inline comments.Mar 19 2020, 8:43 PM
llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4395

For special case, form a new Imm, like:

APInt Res(64, Imm64);                                                                                                                                
ClearBits = Res.countLeadingZeros();                                                                                                                 
if (ClearBits != 0) {                                                                                                                                
  // change pattern |0001111100000011111111|                                                                                                         
  //                       to |1111111100000011111111|                                                                                                         
  APInt Mask = APInt::getBitsSet(64, 64 - ClearBits, 64);                                                                                            
  Res = Res | Mask;                                                                                                                                  
  Imm64 = Res.getZExtValue();                                                                                                                        
}

And pass new Imm64 and ClearBits to normal case, I think logic here maybe a little simple?

Address comments.

steven.zhang marked an inline comment as done.Mar 22 2020, 8:26 PM
steven.zhang added inline comments.
llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4395

Good suggestion.

shchenz accepted this revision.Mar 23 2020, 12:38 AM

LGTM. Please hold on some days in case other reviewers have comments.

This revision is now accepted and ready to land.Mar 23 2020, 12:38 AM

I think this would be immensely more readable if you use more descriptive names for variables. It is very hard to get this bit manipulation right in ones head so I really think you should try your best to make this as simple to follow as possible:

  1. You use one mask in on line 4463 and then a different one on 4476. Please don't do this. Stick with a consistent example.
  2. Rename RotateRightClearLeft to something like RightJustifyRangeAndClear as it appears that is what the function is doing.
  3. Get rid of all the expressions involving ME/MB - especially things like <imm> +/- MB/ME as they are very difficult to reason about. For readability, favour defining temporary values just so they would have a name. For example: MB+63-ME is kind of meaningless to a reader. But if you do something like unsigned FirstBitSetWhenRightJustified = MB + 63 - ME; that is now much easier to follow. I realize that we are creating single-use temporaries this way, but I really think it is worth it for readability.

The algorithm appears to be something along the lines of:

if (!MaskIsTwoContiguousRunsOfOnes)
  return
// Add Missing Bits On Left To The Mask
// Right Justify Mask And Clear Bits Formerly In The Middle
// Rotate Back And Clear Bits Previously Added On Left

And I think the comments, function names and variable names should make it clear that this is what is happening.

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
4373

I think it is confusing for a function called RotateRightClearLeft to emit a "Rotate Left Clear Left"

I think this would be immensely more readable if you use more descriptive names for variables. It is very hard to get this bit manipulation right in ones head so I really think you should try your best to make this as simple to follow as possible:

  1. You use one mask in on line 4463 and then a different one on 4476. Please don't do this. Stick with a consistent example.

Good catch.

  1. Rename RotateRightClearLeft to something like RightJustifyRangeAndClear as it appears that is what the function is doing.

In ISA, the RLDICL is named as: "Rotate Left Doubleword Immediate then Clear Left". I am not sure if the right justify range make it more clear. Regarding to left or right, it is just a wrap rotate. The lambda here is trying to hide the detail that implement the right rotate with left rotate. Personally, I prefer the RotateRightClearLeft one, but also ok to the RightJustifyRangeAndClear if you insist.

  1. Get rid of all the expressions involving ME/MB - especially things like <imm> +/- MB/ME as they are very difficult to reason about. For readability, favour defining temporary values just so they would have a name. For example: MB+63-ME is kind of meaningless to a reader. But if you do something like unsigned FirstBitSetWhenRightJustified = MB + 63 - ME; that is now much easier to follow. I realize that we are creating single-use temporaries this way, but I really think it is worth it for readability.

It is always a balance :) Agree that use the temp variable here as the rotate logic is indeed hard to follow.

The algorithm appears to be something along the lines of:

if (!MaskIsTwoContiguousRunsOfOnes)
  return
// Add Missing Bits On Left To The Mask
// Right Justify Mask And Clear Bits Formerly In The Middle
// Rotate Back And Clear Bits Previously Added On Left

And I think the comments, function names and variable names should make it clear that this is what is happening.

That is nice. Thank you.

Address comments.

  1. Rename RotateRightClearLeft to something like RightJustifyRangeAndClear as it appears that is what the function is doing.

In ISA, the RLDICL is named as: "Rotate Left Doubleword Immediate then Clear Left". I am not sure if the right justify range make it more clear. Regarding to left or right, it is just a wrap rotate. The lambda here is trying to hide the detail that implement the right rotate with left rotate. Personally, I prefer the RotateRightClearLeft one, but also ok to the RightJustifyRangeAndClear if you insist.

I think we can retire this discussion quite easily by just removing the lambda. There isn't a whole lot of point to a lambda that is used only once. If it helped with readability I would be in favour of it, but I think it actually makes the code less readable.

In any case, I described the concise algorithm in pseudo-code not so much because I think it is a useful comment, but because I think that is how the function should flow. For example, isn't something like this much more readable:

bool PPCDAGToDAGISel::tryAsPairOfRLDICL(SDNode *N) {
  assert(N->getOpcode() == ISD::AND && "ISD::AND SDNode expected");
  uint64_t Imm64;

  // Do nothing if it is 16-bit imm as the pattern in the .td file handles
  // it well with "andi.".
  if (!isInt64Immediate(N->getOperand(1).getNode(), Imm64) || isUInt<16>(Imm64))
    return false;

  SDLoc Loc(N);
  SDValue Val = N->getOperand(0);

  // Optimized with two rldicl's as follows:
  // Add missing bits on left to the mask and check that the mask is a
  // wrapped run of ones, i.e.
  // Change pattern |0001111100000011111111|
  //             to |1111111100000011111111|.
  unsigned NumOfLeadingZeros = countLeadingZeros(Imm64);
  if (NumOfLeadingZeros != 0)
    Imm64 |= maskLeadingOnes<uint64_t>(NumOfLeadingZeros);
  unsigned MB, ME;
  if (!isRunOfOnes64(Imm64, MB, ME))
    return false;

  //         ME     MB
  // +----------------------+     +----------------------+
  // |1111111100000011111111| ->  |0000001111111111111111|
  // +----------------------+     +----------------------+
  //  0                    63      0                    63
  // There are ME + 1 ones on the left and (MB - ME + 63) & 63 zeros in between.
  unsigned OnesOnLeft = ME + 1;
  unsigned ZerosInBetween = (MB - ME + 63) & 63;

  // Rotate left by OnesOnLeft (so leading ones are now trailing ones) and clear
  // on the left the bits that are already zeros in the mask.
  Val = SDValue(CurDAG->getMachineNode(PPC::RLDICL, Loc, MVT::i64, Val,
                                       getI64Imm(OnesOnLeft, Loc),
                                       getI64Imm(ZerosInBetween, Loc)),
                0);

  //                                      ME     MB
  // +----------------------+     +----------------------+
  // |0000001111111111111111| ->  |0001111100000011111111|
  // +----------------------+     +----------------------+
  //  0                    63      0                    63
  // Rotate back by 64 - OnesOnLeft to undo previous rotate. Then clear on the
  // left the number of ones we previously added.
  SDValue Ops[] = {Val, getI64Imm(64 - OnesOnLeft, Loc),
                   getI64Imm(NumOfLeadingZeros, Loc)};
  CurDAG->SelectNodeTo(N, PPC::RLDICL, MVT::i64, Ops);
  return true;
}

Note that the above is just a sample of how I think the code should be structured for readability. While I am fairly sure it is semantically equivalent to what you have, I did not carefully verify this.

Make sense. I will update the patch. Thank you for the comments.

Address Nemanjai's comments.

I will upstream this patch if no more comments in several days. Thank you for the nice refactor. (@nemanjai )

nemanjai accepted this revision.Apr 16 2020, 3:23 AM

LGTM. Thanks for your patience. As far as I'm concerned, feel free to commit when you're ready.

This revision was automatically updated to reflect the committed changes.