When legalizing vector operations on vNi128, they will be split to v1i128
because that is a legal type on ppc64, but then the compiler will crash in
selection dag because it fails to select for these operations. This patch fixes
shift operations. Logical shift right and left shift can be performed in the
vector unit, but algebraic shift right requires being split. We only perform
them in the vector unit for Power 9, as it is too costly on other platforms.
Details
Diff Detail
Event Timeline
Note that we should eventually produce the power9 sequence for the non-vector case as well.
lib/Target/PowerPC/PPCInstrAltivec.td | ||
---|---|---|
991 | I think these patterns are missing something. I understand that the shift amount is split up between the VSLO and VSL (i.e. shifting by 9 is meant to shift $vA by one byte and one bit). However, for the VSL, every byte in $vB needs to have the same value in the low order 3 bits or the result is undefined. So I think we need to VSPLTB 15 the $vB. |
test/CodeGen/PowerPC/shift128.ll | ||
---|---|---|
4 | I don't recognize this triple. We typically use powerpc64le-unknown-linux-gnu. |
test/CodeGen/PowerPC/shift128.ll | ||
---|---|---|
4 | Yeah, the vendor part is a nop here, but should be changed. |
Change vendor to unknown
lib/Target/PowerPC/PPCInstrAltivec.td | ||
---|---|---|
991 | The documentation says that for VSL it just uses bits 125:127 from VRB. |
lib/Target/PowerPC/PPCInstrAltivec.td | ||
---|---|---|
991 | From Power ISA 3.0: The result is place into VR[VRT], except if, for any byte And there have apparently been cores in the past (I don't know which ones) that would not produce the correct results if the bottom 3 bits were not the same in all the byte elements. So I'm afraid we'll need this VSPLTB. |
lib/Target/PowerPC/PPCInstrAltivec.td | ||
---|---|---|
991 | OK. But we only want to generate this sequence on Power 9 anyway. Is this well defined on power9? |
The unfortunate situation is that the ISA provides no guarantee that the implementation will do the right thing in this situation. With the most up-to-date ISA, we unfortunately still have to make this a 3 instruction sequence.
However, what's also missing here is the ability to generate better code for "shift immediate".
For reg/reg shifts, we need
(vsl (vslo %a, %b), (vspltb %b, 15)) - could also reduce reg pressure (at the cost of longer path length for the vslo) with (vsl (vslo %a, (vspltb %b, 15)), (vspltb %b, 15))
for reg/imm shifts, we need
(vsl (vslo %a, (xxspltib <imm>)), (xxspltib <imm>))
It may appear that the immediate form is no better than the register form since we're using the same number of instructions. However, if we're materializing a constant shift amount, it will be a load-immediate followed by a direct-move. So this is a significantly better overall sequence.
I think this is fine now (other than a couple of inline nits). LGTM but I'll let @kbarton have a look as well rather than approving the patch. The output patterns could use (VSPLTB 15, %b) in the VSLO portion as well so that it can go in the same register, but I'll leave the choice to do so or not up to you.
We should also probably have a follow-up patch for immediate shifts.
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
687 | s/2/3 (i.e. we need 3 instructions to do the shift). Same below. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
687 | Thanks. |
s/2/3 (i.e. we need 3 instructions to do the shift). Same below.