Page MenuHomePhabricator

Vector element extraction without stack operations on Power 8
ClosedPublic

Authored by nemanjai on Aug 14 2015, 6:03 AM.

Details

Summary

This patch builds onto the patch that provided scalar to vector conversions without stack operations (D11471).
Included in this patch:

  • Vector element extraction for all vector types with constant element number (both LE and BE)
  • Vector element extraction for v16i8 and v8i16 with variable element number (both LE and BE)
  • Removal of some unnecessary COPY_TO_REGCLASS operations that ended up unnecessarily moving things around between registers

Not included in this patch (will be in upcoming patch):

  • Vector element extraction for v4i32, v4f32, v2i64 and v2f64 with variable element number
  • Vector element insertion for variable/constant element number

Testing is provided for all extractions. The extractions that are not implemented yet are just placeholders.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai updated this revision to Diff 32150.Aug 14 2015, 6:03 AM
nemanjai retitled this revision from to Vector element extraction without stack operations on Power 8.
nemanjai updated this object.
nemanjai added reviewers: wschmidt, hfinkel, kbarton, seurer.
nemanjai set the repository for this revision to rL LLVM.
nemanjai added a subscriber: llvm-commits.
nemanjai added inline comments.Aug 14 2015, 6:26 AM
test/CodeGen/PowerPC/p8-scalar_vector_conversions.ll
663

I actually think that the regular expressions for VRA and VRB here are not needed as I think the source vector must be in VR 2. Similarly for all other instances of the vperm.

wschmidt added inline comments.Aug 18 2015, 12:05 PM
lib/Target/PowerPC/PPCInstrVSX.td
1309

This code concerns me, starting with LE_BYTE_2. I see from your test cases that it happens to work, but it looks fragile to me.

If you have bytes 7 6 5 4 3 2 1 0 and apply RLDICL 48, 16, you will get
X X X X X X 3 2 where the Xs have been cleared to zero. The correct answer is X X X X X X X 2.

For RLDICL 40, 24, you will get X X X X X 5 4 3, which is incorrect for both byte and halfword extraction.

The pattern you need is (RLDICL LE_DWORD_0 48, 8), followed by 40,8, etc. Then you need separate patterns for LE_HWORD that uses 16 instead of 8.

Somehow in your tests you are ending up with extsb instructions that make this work, which I really don't understand based on the code you're specifying here (which just creates an i32, not an i8). I am concerned that those are an artifact that could disappear. Can you explain how those come to be generated?

1334

This looks wrong. If $Idx = 19, (ANDC8 (LI8 8), $Idx) will AND 0...001000 with 1...101100, giving 0...001000 = 8. Basically this expression will always produce either 0 or 8, right? That is surely not what you want. This same issue seems repeated several places below, so I've stopped trying to figure these expressions out for now.

Have you done any execution testing to check that the code you're generating works? I'm skeptical that it can.

wschmidt added inline comments.Aug 18 2015, 1:28 PM
lib/Target/PowerPC/PPCInstrVSX.td
1309

Ah, I see why. Your tests are returning an i8, which forces the conversion from i32 to i8 that generates the extsb. Without that, the incorrect code generation would be exposed.

nemanjai added inline comments.Aug 18 2015, 4:11 PM
lib/Target/PowerPC/PPCInstrVSX.td
1309

I think this isn't clearly specified in the language reference, but as far as I can tell, the remainder of the wider register when a value is extracted from a vector is undefined and hence need not be cleared. Namely, when an i8 or i16 is extracted from a vector, it must be extended (sign or zero) since we have no such small registers. Also, I don't think this is necessarily restricted to vector extraction. However we happen to get an i8 or i16 into a register, the SDAG will have a sign_extend or a zero_extend.
Even code like this:

signed char f(signed char c) {
  return c;
}

Will end up with the following SDAG nodes:

0x1003ec39830: i8,ch = load 0x1003ec39708, 0x1003ec39390, 0x1003ec395e0<LD1[%c.addr]> [ORD=4]
0x1003ec3a8c0: i32 = sign_extend 0x1003ec39830 [ORD=5]

Similarly, with code like this:

signed char f(vector signed char c) {
  return c[3];
}

We end up with a very similar pair of SDAG nodes:

0x10009e69b58: i8 = extract_vector_elt 0x10009e697e0, 0x10009e69a30 [ORD=5]
0x10009e69c80: i32 = sign_extend 0x10009e69b58 [ORD=6]

And the sign_extend then becomes an extsb. Of course, this does not apply with fast-isel, but then again, neither does any of this patch since it is all based on the SDAG.
I used signed char here just as an illustration since you mentioned the extsb instruction, but the equivalent zero_extend applies with the unsigned variants (in both cases).

So to summarize, I can add the clear instruction to each case, but I'm quite certain that it will be a reduntant instruction. It will also make the table gen logic more complex because I'd need a clear for the unsigned cases and a sign-extend for the signed cases.

1334

I may be missing something here, but always getting a 0 or 8 is precisely what I was after. Basically, if the element is on the left half of the VSR, I don't need to shift it. If it is on the right side of the VSR, I do need to shift it and I need to shift it by EXACTLY 8 bytes. This applies on both LE and BE except that the element numbers that are on the left vs. right half of the VSR are exact opposites.
For LE, the bytes 0-7 need to be shifted before the MFVSR (so the expression you mentioned returns 8).

Subsequent to this shift, what is then on the left side of the VSR gets moved out to a GPR.
Now that I have the right half of the VSR in the GPR, I need to shift to the right by the correct number of bytes (0-7). So element zero, needs not shift at this point, whereas elements 1-7 need to be shifted by the equal number of bytes. Similarly, element 8 needs not shift, whereas elements 9-15 need to be shifted by $Idx & 7 bytes. Of course, the shift amount I need to specify to SRD is in bits, hence the reason for the left shift of 3.

Here's the complete shift sequence for LE (bytes):

Elem    LShift in VSR    RShift in GPR (bytes)
0       8                0
1       8                1
2       8                2
3       8                3
4       8                4
5       8                5
6       8                6
7       8                7
8       0                0
9       0                1
10      0                2
11      0                3
12      0                4
13      0                5
14      0                6
15      0                7

I have worked out similar sequences for the halfwords and for both on BE.
Also, I have an execution test case that prints each element of each size vector both as constant element values and as variables. In my testing, the test case was compiled with clang and with gcc and the results compared with diff. I have run this on P8 systems, both LE and BE.

nemanjai added inline comments.Aug 18 2015, 4:32 PM
lib/Target/PowerPC/PPCInstrVSX.td
1309

This code:

int f(vector signed char c, int i) {
  return c[3];
}

Has the same extsb, which is then followed in this case by an extsw which is redundant, but understandable given the integral promotion.

wschmidt added inline comments.Aug 18 2015, 6:08 PM
lib/Target/PowerPC/PPCInstrVSX.td
1309

You are relying on context for your patterns to argue that they will always be correct. That is ok, but it is then incumbent on you to document this extremely well. "Clever tricks" should never be unaccompanied by diligent commentary. The fact is that if somebody were to use your "LE_BYTE_x" patterns in the (perfectly reasonable) belief that this would actually produce a right-adjusted zero- or sign-extended byte, they would be grossly disappointed. Outside of their contextual uses in the vector_extract patterns, these would not produce what their name argues they would produce. They would produce an i32 that is pretending to be an i8 but contains values not representable by an i8.

So go ahead and do this, but document the bejeebus out of it. Then if something goes wrong with the contextual assumptions in the future, the poor maintainer will have half a chance of sorting it out. When creating a building block, you have to be aware that others may want to use it in a way you didn't foresee.

I understand why you want to do this, but a more proper solution is to do it right and make sure that we have peephole optimizations later on that will remove redundant extend operations. We need those anyway. Therefore, go ahead and do what you're doing, but in addition to documenting it, add commentary indicating what ought to be done in the future.

Hal, please feel to weigh in...

1334

OK. Again, this needs to be better documented. Your commentary for the LShift portion is:
"What we do is set up the index by masking off bits we don't need and shifting accordingly." You're not actually masking off any bits for this case, which is what confused me. Based on that commentary, this didn't appear to fulfill your intent. I note that your subsequent explanation to me doesn't have any mention of masking off bits either.

So please document more clearly what is going on with these patterns. Too much commentary is never a crime...an expression like

dag BE_VBYTE_SHIFT = (EXTRACT_SUBREG (RLDICR (ANDC8 (LI8 7), $Idx), 3, 60),
                                     sub_32);

without its own line of commentary is kind of a travesty.

If you don't mind, I would like to wait for a more heavily documented version before fully reviewing the variable extracts.

wschmidt added inline comments.Aug 18 2015, 6:11 PM
lib/Target/PowerPC/PPCInstrVSX.td
1334

(I guess that converting to 0 or 8 is "masking off bits," but that really isn't very descriptive of what you're doing. You're just isolating the containing doubleword.)

wschmidt added inline comments.Aug 18 2015, 6:54 PM
lib/Target/PowerPC/PPCInstrVSX.td
1309

Actually, I'm going to stick to my guns on this one.

I'm not asking you to add another extend instruction. I'm asking you to use the correct form of RLDICL for the LE_BYTE_x and LE_HWORD_x cases. If you change the last parameter to RLDICL to always be 8, you will isolate a byte. Do this for the LE_BYTE_x forms. If you change it to always be 16, you will isolate a halfword. Do this for the LE_HWORD_X forms.

This does not cost any extra instructions, and the patterns will generate i32s that contain true i8s and i16s, respectively. Furthermore, this has an added benefit: We can later optimize away the sign- or zero-extend instruction that gets generated by the context. A peephole optimization can easily prove that a rldicl that has already isolated a byte does not need a subsequent extsb or second truncating rldicl. The same is true for isolation of a halfword.

On the other hand, if you leave things as you have them, we can never get rid of the extsb, because the optimizer doesn't know that you left nonzero bits in there because you "knew" there would be an extend coming along to fix things up.

hfinkel added inline comments.Aug 19 2015, 1:11 AM
lib/Target/PowerPC/PPCInstrVSX.td
1309

First, Nemanja appears to be right, it is legal to leave the higher-order bits arbitrarily defined. The code in DAGTypeLegalizer::PromoteIntOp_EXTRACT_VECTOR_ELT ends with:

// EXTRACT_VECTOR_ELT can return types which are wider than the incoming
// element types. If this is the case then we need to expand the outgoing
// value and not truncate it.
return DAG.getAnyExtOrTrunc(Ext, dl, N->getValueType(0));

Second, Bill is right. We should zero-out the higher-order bits if we can do so without any extra instructions. We already have a peephole optimization in PPCISelDAGToDAG.cpp to eliminate unnecessary i32->i64 extensions (see the PeepholePPC64ZExt() function), and it should be easy to extend it to catch this case as well. Also, it makes "manual" debugging easier to have the values zero extended (and making debugging easier at no additional cost is certainly a good thing).

And, for any pattern where you do leave some of the higher-order bits arbitrarily defined, please document that explicitly.

nemanjai added inline comments.Sep 21 2015, 2:12 AM
lib/Target/PowerPC/PPCInstrVSX.td
1309

OK, it appears that the source of my misunderstanding was that I thought there was no way to both shift right and clear high order bits with a single RLDICL instruction.
Of course, it is clearly preferable for a moved i8, i16 or i32 to have the high order bits cleared rather than arbitrarily set. I will change the last parameter to the RLDICL instructions to the moved witdth, test it and update the patch. Thank you both for your feedback.

nemanjai updated this revision to Diff 35253.Sep 21 2015, 6:53 AM

One major drawback identified by wschmidt with the previous patch was that the element being moved was just right justified in the GPR without clearing the high order bits. Basically, the only change is to use the version of the RLDICL instruction that will not only right justify the value, but will clear the other bits too.

Sorry, don't review this yet. Another update is coming very soon. I forgot to improve the documentation for the variable element number extract patterns. Stay tuned and sorry about this omission - I was too focused on the RLDICL patterns.

nemanjai updated this revision to Diff 35274.Sep 21 2015, 10:04 AM

Documented the really complex patterns that extract a variable element number from a vector.

wschmidt edited edge metadata.Sep 21 2015, 1:14 PM

Otherwise, this LGTM. Thanks for addressing all my concerns!

lib/Target/PowerPC/PPCISelLowering.cpp
558

You don't want to reintroduce this restriction, right?

test/CodeGen/PowerPC/p8-scalar_vector_conversions.ll
64

Same here -- don't disable these checks.

96

Do we have an issue open for the unnecessary extsb? We should peephole this away at some point. Please open one if there is not.

735

Similarly, the extsh should be peepholed away eventually.

Bill,
I've opened a work item to track the redundant rldicl instructions for unsigned versions of these. We can optimize those away at some point.
Since Hal's OK with this patch and if you're satisfied with the changes and the follow-up work item, please accept this patch and I'll commit it.

lib/Target/PowerPC/PPCISelLowering.cpp
558

That's just the existing code since the bootstrap failure fix is not committed yet.

test/CodeGen/PowerPC/p8-scalar_vector_conversions.ll
64

Same, that will be part of the commit for the bootstrap failure fix.

735

We need the sign extends as we've discussed on IRC.

wschmidt accepted this revision.Sep 29 2015, 11:06 AM
wschmidt edited edge metadata.

Go to it!

This revision is now accepted and ready to land.Sep 29 2015, 11:06 AM
nemanjai closed this revision.Oct 9 2015, 4:16 AM

Committed revision 249822.