This is an archive of the discontinued LLVM Phabricator instance.

[Power9] Exploit vector extract with variable index
ClosedPublic

Authored by syzaara on Jun 8 2017, 6:21 AM.

Details

Summary

This patch adds the exploitation for new power 9 instructions which extract variable elements from vectors:
VEXTUBLX
VEXTUBRX
VEXTUHLX
VEXTUHRX
VEXTUWLX
VEXTUWRX

Diff Detail

Repository
rL LLVM

Event Timeline

syzaara created this revision.Jun 8 2017, 6:21 AM
syzaara retitled this revision from [Power9] Expoilt vector extract with variable index to [Power9] Exploit vector extract with variable index.Jun 8 2017, 6:25 AM
nemanjai edited edge metadata.Jun 9 2017, 12:03 AM

I suspect that the total latency of an LI, VEXTU[BH][LR]X for extracting constant elements is probably less than the current set up when a shift in the vector element is required. We should probably use these new instructions for such extractions as well.
When it comes to word extractions, I don't think it makes a difference, but halfword and byte ones are probably better off using the new instructions.
I'm fine with that being a separate patch, but we shouldn't forget it.

lib/Target/PowerPC/PPCInstrVSX.td
1909 ↗(On Diff #101898)

Please don't use the multiply instruction when a shift is perfectly adequate (and likely much lower latency).

I suspect that the total latency of an LI, VEXTU[BH][LR]X for extracting constant elements is probably less than the current set up when a shift in the vector element is required. We should probably use these new instructions for such extractions as well.
When it comes to word extractions, I don't think it makes a difference, but halfword and byte ones are probably better off using the new instructions.
I'm fine with that being a separate patch, but we shouldn't forget it.

The LI was already being added when using an immediate value for the index. I added a new testcase to cover this case.

syzaara updated this revision to Diff 102228.Jun 12 2017, 12:35 PM

Using shifts rather than multiplies and added a test case to show use of LI when index is an immediate.

I suspect that the total latency of an LI, VEXTU[BH][LR]X for extracting constant elements is probably less than the current set up when a shift in the vector element is required. We should probably use these new instructions for such extractions as well.
When it comes to word extractions, I don't think it makes a difference, but halfword and byte ones are probably better off using the new instructions.
I'm fine with that being a separate patch, but we shouldn't forget it.

The LI was already being added when using an immediate value for the index. I added a new testcase to cover this case.

This is understandable. Of course, if you were to just add a pattern for all the possible element indices (like the current patterns), then the LI that you get won't need to be shifted/multiplied. I think we should probably do that.

lib/Target/PowerPC/PPCInstrVSX.td
1909 ↗(On Diff #102228)

I assumed this would be an RLDICR but this works just the same. In either case, I think the high-order bits should be cleared explicitly so the RLWINM should really have 1, 28, 30 as immediates (since the instruction takes its input in bits 60-63).

syzaara updated this revision to Diff 102365.Jun 13 2017, 10:29 AM
  1. Added patterns for each immediate element value so the LI doesn't need to be multiplied/shifted.
  2. Clear the upper bits with the correct mask for rlwinm
stefanp added inline comments.Jun 20 2017, 1:03 PM
test/CodeGen/PowerPC/vec_extract_p9.ll
117 ↗(On Diff #102365)

This is probably fine. I have more of a question here...
I see that throughout the tests (not just here) the registers used are specified.
While the register allocator is most likely to pick r3 in this case is that a guarantee? The load immediate is not constrained by the ABI so technically it could use a different register and this is still correct.

nemanjai accepted this revision.Jun 29 2017, 1:18 AM

LGTM.

test/CodeGen/PowerPC/vec_extract_p9.ll
117 ↗(On Diff #102365)

The CHECK directives were produced by a script (see the first line in the test case). The choices register allocator makes should be fairly consistent so this should be fine.

This revision is now accepted and ready to land.Jun 29 2017, 1:18 AM
This revision was automatically updated to reflect the committed changes.