This patch adds the exploitation for new power 9 instructions which extract variable elements from vectors:
VEXTUBLX
VEXTUBRX
VEXTUHLX
VEXTUHRX
VEXTUWLX
VEXTUWRX
Details
Diff Detail
Event Timeline
I suspect that the total latency of an LI, VEXTU[BH][LR]X for extracting constant elements is probably less than the current set up when a shift in the vector element is required. We should probably use these new instructions for such extractions as well.
When it comes to word extractions, I don't think it makes a difference, but halfword and byte ones are probably better off using the new instructions.
I'm fine with that being a separate patch, but we shouldn't forget it.
lib/Target/PowerPC/PPCInstrVSX.td | ||
---|---|---|
1909 | Please don't use the multiply instruction when a shift is perfectly adequate (and likely much lower latency). |
The LI was already being added when using an immediate value for the index. I added a new testcase to cover this case.
Using shifts rather than multiplies and added a test case to show use of LI when index is an immediate.
This is understandable. Of course, if you were to just add a pattern for all the possible element indices (like the current patterns), then the LI that you get won't need to be shifted/multiplied. I think we should probably do that.
lib/Target/PowerPC/PPCInstrVSX.td | ||
---|---|---|
1909 | I assumed this would be an RLDICR but this works just the same. In either case, I think the high-order bits should be cleared explicitly so the RLWINM should really have 1, 28, 30 as immediates (since the instruction takes its input in bits 60-63). |
- Added patterns for each immediate element value so the LI doesn't need to be multiplied/shifted.
- Clear the upper bits with the correct mask for rlwinm
test/CodeGen/PowerPC/vec_extract_p9.ll | ||
---|---|---|
118 | This is probably fine. I have more of a question here... |
LGTM.
test/CodeGen/PowerPC/vec_extract_p9.ll | ||
---|---|---|
118 | The CHECK directives were produced by a script (see the first line in the test case). The choices register allocator makes should be fairly consistent so this should be fine. |
Please don't use the multiply instruction when a shift is perfectly adequate (and likely much lower latency).