Page MenuHomePhabricator

[x86] Improve build_vector v8i16 codegen
AbandonedPublic

Authored by bruno on Jan 26 2015, 6:18 AM.

Details

Summary

Currently, even with SSE2 enabled, we're generating a series of movds and punpckl to insert elements in v8i16. The testcase in the patch is currently transformed to:

movd    %r8d, %xmm0
movd    24(%rsp), %xmm1
punpcklwd       %xmm1, %xmm0
movd    %edx, %xmm1
movd    8(%rsp), %xmm2
punpcklwd       %xmm2, %xmm1
punpcklwd       %xmm0, %xmm1
movd    %ecx, %xmm0
movd    16(%rsp), %xmm2
punpcklwd       %xmm2, %xmm0
movd    %r9d, %xmm2
movd    %esi, %xmm3
punpcklwd       %xmm2, %xmm3
punpcklwd       %xmm0, %xmm3
punpcklwd       %xmm1, %xmm3

Where it could be replaced by a series of pinsrw instructions, saving 8 instructions:

pinsrw  $0, %esi, %xmm0
pinsrw  $1, %edx, %xmm0
pinsrw  $2, %ecx, %xmm0
pinsrw  $3, %r8d, %xmm0
pinsrw  $4, %r9d, %xmm0
pinsrw  $5, 8(%rsp), %xmm0
pinsrw  $6, 16(%rsp), %xmm0
pinsrw  $7, 24(%rsp), %xmm0

This patch adds this change while it also looks for an opportunity where we could transform this into a SHUFFLE+VEC_INSERT_ELTS first. Most part of this patch is about moving some functions around, that will come in a separated commit.

Diff Detail

Event Timeline

bruno updated this revision to Diff 18756.Jan 26 2015, 6:18 AM
bruno retitled this revision from to [x86] Improve build_vector v8i16 codegen.
bruno updated this object.
bruno edited the test plan for this revision. (Show Details)
bruno added reviewers: nadav, spatel, anemet, chandlerc.
bruno added a subscriber: Unknown Object (MLST).

Hi Bruno,

I am not sure this is the right thing to do.
Do you see any performance improvement with the new sequence?

My concern here is, with the new sequence, we have a complete linear sequence of instructions whereas the old sequence can be partly parallelized. Running both the new and old sequence through IACA, I see the following throughputs:
Old:

  • Sandy Bridge: 6.15 cycles.
  • Ivy Bridge: 6.15 cycles.
  • Haswell: 12 cycles.

New:

  • Sandy Bridge: 13 cycles.
  • Ivy Bridge: 13 cycles.
  • Haswell: 13 cycles.

This seems to concur my hypothesis.

Thanks,
-Quentin

bruno added a comment.Jan 26 2015, 9:56 AM

Thanks for the detailed measurements Quentin.

I've received a report about a performance degradation regarding this code against hand-written pinsrw assembly and trusted the source =(
I'll try it out myself and get back if it really shows up some performance improvements.

  • Original Message -----

From: "Bruno Cardoso Lopes" <bruno.cardoso@gmail.com>
To: "bruno cardoso" <bruno.cardoso@gmail.com>, nrotem@apple.com, spatel@rotateright.com, anemet@apple.com,
chandlerc@gmail.com
Cc: llvm-commits@cs.uiuc.edu
Sent: Monday, January 26, 2015 11:56:27 AM
Subject: Re: [PATCH] [x86] Improve build_vector v8i16 codegen

Thanks for the detailed measurements Quentin.

I've received a report about a performance degradation regarding this
code against hand-written pinsrw assembly and trusted the source =(
I'll try it out myself and get back if it really shows up some
performance improvements.

I'm not particularly familiar with the problem or the solution space, so take this with a grain of salt...

I've seen this kind of thing vary in utility depending on whether the instruction sequence in question is on the critical path or not. If it is on the critical path, then having a longer sequence with more ILP is normally a win. Otherwise, the shorter sequence is normally a win.

I'll also point out that we have the lib/CodeGen/MachineCombiner.cpp pass, that is specifically designed to do this kind of sequence substitution only in cases where we don't increase the critical path length. If my hypothesis is right, this would be a good potential use of that pass (which is theoretically target independent, although currently used only by AArch64).

-Hal

http://reviews.llvm.org/D7177

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/

llvm-commits mailing list
llvm-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

Hi Hal,

I wasn't aware of that pass, thanks for the heads up! :-)

chandlerc requested changes to this revision.Mar 29 2015, 2:30 PM
chandlerc edited edge metadata.

Just marking that if this is still useful, it will need changes to at least avoid the severe performance regression Quentin pointed out.

This revision now requires changes to proceed.Mar 29 2015, 2:30 PM
bruno abandoned this revision.Apr 13 2017, 11:49 AM

I ran this through IACA a few years ago, and didn't prove to be any better.

Current:

  • Sandy Bridge: 6.15 cycles.
  • Ivy Bridge: 6.15 cycles.
  • Haswell: 12 cycles.

New:

  • Sandy Bridge: 13 cycles.
  • Ivy Bridge: 13 cycles.
  • Haswell: 13 cycles.

Abandoning.