This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
2
PPCInstrVSX.td
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
2
vec_extract_p9.ll

Differential D34032

[Power9] Exploit vector extract with variable index
ClosedPublic

Authored by syzaara on Jun 8 2017, 6:21 AM.

Download Raw Diff

Details

Reviewers

kbarton
nemanjai
sfertile
lei
jtony
stefanp
echristo
hfinkel
inouehrs

Commits

rGaa5a6a1c307e: [Power9] Exploit vector extract with variable index.
rL307174: [Power9] Exploit vector extract with variable index.

Summary

This patch adds the exploitation for new power 9 instructions which extract variable elements from vectors:
VEXTUBLX
VEXTUBRX
VEXTUHLX
VEXTUHRX
VEXTUWLX
VEXTUWRX

Diff Detail

Event Timeline

syzaara created this revision.Jun 8 2017, 6:21 AM

syzaara retitled this revision from [Power9] Expoilt vector extract with variable index to [Power9] Exploit vector extract with variable index.Jun 8 2017, 6:25 AM

I suspect that the total latency of an LI, VEXTU[BH][LR]X for extracting constant elements is probably less than the current set up when a shift in the vector element is required. We should probably use these new instructions for such extractions as well.
When it comes to word extractions, I don't think it makes a difference, but halfword and byte ones are probably better off using the new instructions.
I'm fine with that being a separate patch, but we shouldn't forget it.

lib/Target/PowerPC/PPCInstrVSX.td
1909	Please don't use the multiply instruction when a shift is perfectly adequate (and likely much lower latency).

In D34032#776682, @nemanjai wrote:

I suspect that the total latency of an LI, VEXTU[BH][LR]X for extracting constant elements is probably less than the current set up when a shift in the vector element is required. We should probably use these new instructions for such extractions as well.
When it comes to word extractions, I don't think it makes a difference, but halfword and byte ones are probably better off using the new instructions.
I'm fine with that being a separate patch, but we shouldn't forget it.

The LI was already being added when using an immediate value for the index. I added a new testcase to cover this case.

Using shifts rather than multiplies and added a test case to show use of LI when index is an immediate.

In D34032#778271, @syzaara wrote:

In D34032#776682, @nemanjai wrote:

I suspect that the total latency of an LI, VEXTU[BH][LR]X for extracting constant elements is probably less than the current set up when a shift in the vector element is required. We should probably use these new instructions for such extractions as well.
When it comes to word extractions, I don't think it makes a difference, but halfword and byte ones are probably better off using the new instructions.
I'm fine with that being a separate patch, but we shouldn't forget it.

The LI was already being added when using an immediate value for the index. I added a new testcase to cover this case.

This is understandable. Of course, if you were to just add a pattern for all the possible element indices (like the current patterns), then the LI that you get won't need to be shifted/multiplied. I think we should probably do that.

lib/Target/PowerPC/PPCInstrVSX.td
1909	I assumed this would be an `RLDICR` but this works just the same. In either case, I think the high-order bits should be cleared explicitly so the `RLWINM` should really have 1, 28, 30 as immediates (since the instruction takes its input in bits 60-63).

Added patterns for each immediate element value so the LI doesn't need to be multiplied/shifted.
Clear the upper bits with the correct mask for rlwinm

stefanp added inline comments.Jun 20 2017, 1:03 PM

test/CodeGen/PowerPC/vec_extract_p9.ll
118	This is probably fine. I have more of a question here... I see that throughout the tests (not just here) the registers used are specified. While the register allocator is most likely to pick r3 in this case is that a guarantee? The load immediate is not constrained by the ABI so technically it could use a different register and this is still correct.

ping

LGTM.

test/CodeGen/PowerPC/vec_extract_p9.ll
118	The CHECK directives were produced by a script (see the first line in the test case). The choices register allocator makes should be fairly consistent so this should be fine.

This revision is now accepted and ready to land.Jun 29 2017, 1:18 AM

Closed by commit rL307174: [Power9] Exploit vector extract with variable index. (authored by jtony). · Explain WhyJul 5 2017, 9:55 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

PowerPC/

PPCInstrVSX.td

22 lines

test/

CodeGen/

PowerPC/

vec_extract_p9.ll

117 lines

Diff 101898

lib/Target/PowerPC/PPCInstrVSX.td

	Show First 20 Lines • Show All 1,895 Lines • ▼ Show 20 Lines
	// Variable index vector_extract for v2f64 does not require P8Vector			// Variable index vector_extract for v2f64 does not require P8Vector
	let Predicates = [IsLittleEndian, HasVSX] in			let Predicates = [IsLittleEndian, HasVSX] in
	def : Pat<(f64 (vector_extract v2f64:$S, i64:$Idx)),			def : Pat<(f64 (vector_extract v2f64:$S, i64:$Idx)),
	(f64 VectorExtractions.LE_VARIABLE_DOUBLE)>;			(f64 VectorExtractions.LE_VARIABLE_DOUBLE)>;

	def : Pat<(v4i32 (int_ppc_vsx_lxvw4x_be xoaddr:$src)), (LXVW4X xoaddr:$src)>;			def : Pat<(v4i32 (int_ppc_vsx_lxvw4x_be xoaddr:$src)), (LXVW4X xoaddr:$src)>;
	def : Pat<(v2f64 (int_ppc_vsx_lxvd2x_be xoaddr:$src)), (LXVD2X xoaddr:$src)>;			def : Pat<(v2f64 (int_ppc_vsx_lxvd2x_be xoaddr:$src)), (LXVD2X xoaddr:$src)>;

				// Variable index unsigned vector_extract on Power9
				let Predicates = [HasP9Altivec, IsLittleEndian] in {
				def : Pat<(i64 (anyext (i32 (vector_extract v16i8:$S, i64:$Idx)))),
				(VEXTUBRX $Idx, $S)>;
				def : Pat<(i64 (anyext (i32 (vector_extract v8i16:$S, i64:$Idx)))),
				(VEXTUHRX (MULLI8 $Idx, 2), $S)>;
				nemanjaiUnsubmitted Not Done Reply Inline Actions Please don't use the multiply instruction when a shift is perfectly adequate (and likely much lower latency). nemanjai: Please don't use the multiply instruction when a shift is perfectly adequate (and likely much…
				nemanjaiUnsubmitted Not Done Reply Inline Actions I assumed this would be an `RLDICR` but this works just the same. In either case, I think the high-order bits should be cleared explicitly so the `RLWINM` should really have 1, 28, 30 as immediates (since the instruction takes its input in bits 60-63). nemanjai: I assumed this would be an `RLDICR` but this works just the same. In either case, I think the…
				def : Pat<(i64 (zext (i32 (vector_extract v4i32:$S, i64:$Idx)))),
				(VEXTUWRX (MULLI8 $Idx, 4), $S)>;
				def : Pat<(i64 (sext (i32 (vector_extract v4i32:$S, i64:$Idx)))),
				(EXTSW (VEXTUWRX (MULLI8 $Idx, 4), $S))>;
				}
				let Predicates = [HasP9Altivec, IsBigEndian] in {
				def : Pat<(i64 (anyext (i32 (vector_extract v16i8:$S, i64:$Idx)))),
				(VEXTUBLX $Idx, $S)>;
				def : Pat<(i64 (anyext (i32 (vector_extract v8i16:$S, i64:$Idx)))),
				(VEXTUHLX (MULLI8 $Idx, 2), $S)>;
				def : Pat<(i64 (zext (i32 (vector_extract v4i32:$S, i64:$Idx)))),
				(VEXTUWLX (MULLI8 $Idx, 4), $S)>;
				def : Pat<(i64 (sext (i32 (vector_extract v4i32:$S, i64:$Idx)))),
				(EXTSW (VEXTUWLX (MULLI8 $Idx, 4), $S))>;
				}

	let Predicates = [IsLittleEndian, HasDirectMove] in {			let Predicates = [IsLittleEndian, HasDirectMove] in {
	// v16i8 scalar <-> vector conversions (LE)			// v16i8 scalar <-> vector conversions (LE)
	def : Pat<(v16i8 (scalar_to_vector i32:$A)),			def : Pat<(v16i8 (scalar_to_vector i32:$A)),
	(v16i8 (COPY_TO_REGCLASS MovesToVSR.LE_WORD_0, VSRC))>;			(v16i8 (COPY_TO_REGCLASS MovesToVSR.LE_WORD_0, VSRC))>;
	def : Pat<(v8i16 (scalar_to_vector i32:$A)),			def : Pat<(v8i16 (scalar_to_vector i32:$A)),
	(v8i16 (COPY_TO_REGCLASS MovesToVSR.LE_WORD_0, VSRC))>;			(v8i16 (COPY_TO_REGCLASS MovesToVSR.LE_WORD_0, VSRC))>;
	def : Pat<(v4i32 (scalar_to_vector i32:$A)),			def : Pat<(v4i32 (scalar_to_vector i32:$A)),
	(v4i32 MovesToVSR.LE_WORD_0)>;			(v4i32 MovesToVSR.LE_WORD_0)>;
	▲ Show 20 Lines • Show All 1,061 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/vec_extract_p9.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -verify-machineinstrs -mtriple=powerpc64le-unknown-gnu-linux -mcpu=pwr9 < %s \| FileCheck %s -check-prefix=CHECK-LE
				; RUN: llc -verify-machineinstrs -mtriple=powerpc64-unknown-gnu-linux -mcpu=pwr9 < %s \| FileCheck %s -check-prefix=CHECK-BE

				; Function Attrs: noinline norecurse nounwind readnone
				define zeroext i8 @test1(<16 x i8> %a, i32 signext %index) {
				; CHECK-LE-LABEL: test1:
				; CHECK-LE: # BB#0: # %entry
				; CHECK-LE-NEXT: vextubrx 3, 5, 2
				; CHECK-LE-NEXT: clrldi 3, 3, 56
				; CHECK-LE-NEXT: blr
				; CHECK-BE-LABEL: test1:
				; CHECK-BE: # BB#0: # %entry
				; CHECK-BE-NEXT: vextublx 3, 5, 2
				; CHECK-BE-NEXT: clrldi 3, 3, 56
				; CHECK-BE-NEXT: blr

				entry:
				%vecext = extractelement <16 x i8> %a, i32 %index
				ret i8 %vecext
				}

				; Function Attrs: noinline norecurse nounwind readnone
				define signext i8 @test2(<16 x i8> %a, i32 signext %index) {
				; CHECK-LE-LABEL: test2:
				; CHECK-LE: # BB#0: # %entry
				; CHECK-LE-NEXT: vextubrx 3, 5, 2
				; CHECK-LE-NEXT: extsb 3, 3
				; CHECK-LE-NEXT: blr
				; CHECK-BE-LABEL: test2:
				; CHECK-BE: # BB#0: # %entry
				; CHECK-BE-NEXT: vextublx 3, 5, 2
				; CHECK-BE-NEXT: extsb 3, 3
				; CHECK-BE-NEXT: blr

				entry:
				%vecext = extractelement <16 x i8> %a, i32 %index
				ret i8 %vecext
				}

				; Function Attrs: noinline norecurse nounwind readnone
				define zeroext i16 @test3(<8 x i16> %a, i32 signext %index) {
				; CHECK-LE-LABEL: test3:
				; CHECK-LE: # BB#0: # %entry
				; CHECK-LE-NEXT: mulli 3, 5, 2
				; CHECK-LE-NEXT: vextuhrx 3, 3, 2
				; CHECK-LE-NEXT: clrldi 3, 3, 48
				; CHECK-LE-NEXT: blr
				; CHECK-BE-LABEL: test3:
				; CHECK-BE: # BB#0: # %entry
				; CHECK-BE-NEXT: mulli 3, 5, 2
				; CHECK-BE-NEXT: vextuhlx 3, 3, 2
				; CHECK-BE-NEXT: clrldi 3, 3, 48
				; CHECK-BE-NEXT: blr

				entry:
				%vecext = extractelement <8 x i16> %a, i32 %index
				ret i16 %vecext
				}

				; Function Attrs: noinline norecurse nounwind readnone
				define signext i16 @test4(<8 x i16> %a, i32 signext %index) {
				; CHECK-LE-LABEL: test4:
				; CHECK-LE: # BB#0: # %entry
				; CHECK-LE-NEXT: mulli 3, 5, 2
				; CHECK-LE-NEXT: vextuhrx 3, 3, 2
				; CHECK-LE-NEXT: extsh 3, 3
				; CHECK-LE-NEXT: blr
				; CHECK-BE-LABEL: test4:
				; CHECK-BE: # BB#0: # %entry
				; CHECK-BE-NEXT: mulli 3, 5, 2
				; CHECK-BE-NEXT: vextuhlx 3, 3, 2
				; CHECK-BE-NEXT: extsh 3, 3
				; CHECK-BE-NEXT: blr

				entry:
				%vecext = extractelement <8 x i16> %a, i32 %index
				ret i16 %vecext
				}

				; Function Attrs: noinline norecurse nounwind readnone
				define zeroext i32 @test5(<4 x i32> %a, i32 signext %index) {
				; CHECK-LE-LABEL: test5:
				; CHECK-LE: # BB#0: # %entry
				; CHECK-LE-NEXT: mulli 3, 5, 4
				; CHECK-LE-NEXT: vextuwrx 3, 3, 2
				; CHECK-LE-NEXT: blr
				; CHECK-BE-LABEL: test5:
				; CHECK-BE: # BB#0: # %entry
				; CHECK-BE-NEXT: mulli 3, 5, 4
				; CHECK-BE-NEXT: vextuwlx 3, 3, 2
				; CHECK-BE-NEXT: blr

				entry:
				%vecext = extractelement <4 x i32> %a, i32 %index
				ret i32 %vecext
				}

				; Function Attrs: noinline norecurse nounwind readnone
				define signext i32 @test6(<4 x i32> %a, i32 signext %index) {
				; CHECK-LE-LABEL: test6:
				; CHECK-LE: # BB#0: # %entry
				; CHECK-LE-NEXT: mulli 3, 5, 4
				; CHECK-LE-NEXT: vextuwrx 3, 3, 2
				; CHECK-LE-NEXT: extsw 3, 3
				; CHECK-LE-NEXT: blr
				; CHECK-BE-LABEL: test6:
				; CHECK-BE: # BB#0: # %entry
				; CHECK-BE-NEXT: mulli 3, 5, 4
				; CHECK-BE-NEXT: vextuwlx 3, 3, 2
				; CHECK-BE-NEXT: extsw 3, 3
				; CHECK-BE-NEXT: blr

				entry:
				%vecext = extractelement <4 x i32> %a, i32 %index
				ret i32 %vecext
				}
				stefanpUnsubmitted Not Done Reply Inline Actions This is probably fine. I have more of a question here... I see that throughout the tests (not just here) the registers used are specified. While the register allocator is most likely to pick r3 in this case is that a guarantee? The load immediate is not constrained by the ABI so technically it could use a different register and this is still correct. stefanp: This is probably fine. I have more of a question here... I see that throughout the tests (not…
				nemanjaiUnsubmitted Not Done Reply Inline Actions The CHECK directives were produced by a script (see the first line in the test case). The choices register allocator makes should be fairly consistent so this should be fine. nemanjai: The CHECK directives were produced by a script (see the first line in the test case). The…