This is an archive of the discontinued LLVM Phabricator instance.

Adjust the cost of vectorized SHL/SRL/SRA
Needs ReviewPublic

Authored by wmi on May 21 2015, 4:15 PM.

Details

Summary

The patch is to solve the problem in https://llvm.org/bugs/show_bug.cgi?id=23582. It adjusts the cost of vectorized SHL/SRL/SRA and makes sure they are lowered to vectorized shift instruction.

There are a bunch of testcases needed to be adjusted. send out the patch first to see if it is ok generally. Will update the patch with adjusted tests.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi updated this revision to Diff 26282.May 21 2015, 4:15 PM
wmi retitled this revision from to Adjust the cost of vectorized SHL/SRL/SRA.
wmi updated this object.
wmi edited the test plan for this revision. (Show Details)
wmi added reviewers: nadav, aschwaighofer.
wmi set the repository for this revision to rL LLVM.
wmi added subscribers: Unknown Object (MLST), davidxl.
nadav edited edge metadata.May 21 2015, 4:18 PM

Looks good to me assuming that you write the cost model tests, and that you write tests for the ISel changes.

Thanks for looking at this - comments below. I should mention I have some similar work in progress in D9474 and D9645 that is trying to efficiently vectorize more integer shifts.

lib/Target/X86/X86ISelLowering.cpp
16421 ↗(On Diff #26282)

I think this is out of date - Elena did some refactoring on this recently.

lib/Target/X86/X86TargetTransformInfo.cpp
274

Why is v4i64 declared here?

284

SSE only has fast variable shift support for uniform values - these cost values surely should reflect the likely cost of general shift.

It would be better to update the TargetTransformInfo::OK_UniformConstantValue tables + code above this to support TargetTransformInfo::OK_UniformValue as well.

aschwaighofer edited edge metadata.May 22 2015, 7:13 AM

I share Simon's concerns. Please make sure that we still get a good estimate for kernels like (these are from the rdar mentioned in the commit).

#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}

or:

for(k=0, r=0; k<pos; k++)
  r += (MAX_UNSIGNED) 1 << k;
wmi added a comment.May 22 2015, 10:03 AM

I share Simon's concerns. Please make sure that we still get a good estimate for kernels like (these are from the rdar mentioned in the commit).

#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}

or:

for(k=0, r=0; k<pos; k++)
  r += (MAX_UNSIGNED) 1 << k;

Thanks for sharing the testcase. For the first testcase:

Without the patch, the generated code for the kernel loop is:
.LBB0_1: # %for.body

  1. =>This Inner Loop Header: Depth=1 movsbl B1+1024(%rax), %edx movb C1+1024(%rax), %cl sarl %cl, %edx movb %dl, A1+1024(%rax) incq %rax jne .LBB0_1

With the patch, the generated code for the kernel loop is:
.LBB0_1: # %vector.body

  1. =>This Inner Loop Header: Depth=1 movd B1+1024(%rax), %xmm1 # xmm1 = mem[0],zero,zero,zero punpcklbw %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3] psrad $24, %xmm1 movd C1+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero punpcklbw %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3] psrad $24, %xmm2 psrad %xmm2, %xmm1 pand %xmm0, %xmm1 packuswb %xmm1, %xmm1 packuswb %xmm1, %xmm1 movd %xmm1, A1+1024(%rax) addq $4, %rax jne .LBB0_1

The vectorized version is slightly better than the scalarized version. But the cost estimation to compute VF is not very good -- The cost estimation shows cost is 8 when VF==1 and cost is 2 when VF==4. The estimated costs of vectorized sext and trunc are too low and don't match the real costs.

Another problem is that vectorizer doesn't know the char->int type promotion here is unnecessary.

Can you give me the whole version of the second testcase? I am not sure my tweaked version is the right one.

wmi added a comment.May 22 2015, 10:18 AM

Sorry. The format of the assembly code was removed when I replied from
phabricator. repaste it here:

#define TYPE char
#define OP >>
#define SIZE 1024
#define TYPE_ALIGN __attribute__((aligned(16)))

TYPE A1[SIZE] TYPE_ALIGN;
TYPE B1[SIZE] TYPE_ALIGN;
TYPE C1[SIZE] TYPE_ALIGN;

void kernel1() {
  for (int i = 0; i < SIZE; ++i) {
    A1[i] = B1[i] OP C1[i];
}

Without the patch, the kernel loop:
.LBB0_1: # %for.body

movsbl  B1+1024(%rax), %edx
movb    C1+1024(%rax), %cl
sarl    %cl, %edx
movb    %dl, A1+1024(%rax)
incq    %rax
jne     .LBB0_1

With the patch, the kernel loop:
.LBB0_1: # %vector.body

movd    B1+1024(%rax), %xmm1    # xmm1 = mem[0],zero,zero,zero
punpcklbw       %xmm1, %xmm1    # xmm1 =

xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]

punpcklwd       %xmm1, %xmm1    # xmm1 = xmm1[0,0,1,1,2,2,3,3]
psrad   $24, %xmm1
movd    C1+1024(%rax), %xmm2    # xmm2 = mem[0],zero,zero,zero
punpcklbw       %xmm2, %xmm2    # xmm2 =

xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]

punpcklwd       %xmm2, %xmm2    # xmm2 = xmm2[0,0,1,1,2,2,3,3]
psrad   $24, %xmm2
psrad   %xmm2, %xmm1
pand    %xmm0, %xmm1
packuswb        %xmm1, %xmm1
packuswb        %xmm1, %xmm1
movd    %xmm1, A1+1024(%rax)
addq    $4, %rax
jne     .LBB0_1

Although the vectorized version is slightly better, the cost
estimation is not precise (vectorizer cost estmiation says VF==4 (cost
is 2) is much better than VF==1 (cost is 8)).

Wei.

wmi added a comment.May 22 2015, 11:25 AM

Ah, the code generated by this patch is incorrect. As Simon said, SSE
only has fast variable shift support for uniform values. I
misunderstood the meaning of the vectorized shift instruction. Will
rewrite the patch.

wmi updated this revision to Diff 26536.May 26 2015, 2:19 PM
wmi edited edge metadata.

I updated the patch.

The cost of vectorized shift with uniform scalar shift amount is adjusted. lowerShift has already contained the logic to lower such shift to ISD::VSHL/VSRL/VSRA properly.

It works for the motivational case in PR23582. llvm unittest passes.

Thank you for the update. I don't know enough about the LoopVectorize code to review but the TTI cost model looks correct.

Please can you add uniform non-constant tests to:

test/Analysis/CostModel/X86/testshiftashr.ll
test/Analysis/CostModel/X86/testshiftlshr.ll
test/Analysis/CostModel/X86/testshiftshl.ll

wmi updated this revision to Diff 26983.Jun 2 2015, 10:10 AM

Please can you add uniform non-constant tests to:
test/Analysis/CostModel/X86/testshiftashr.ll
test/Analysis/CostModel/X86/testshiftlshr.ll
test/Analysis/CostModel/X86/testshiftshl.ll

Thanks for the review. I add uniform variant tests in those files. Another change in this patch is to try to set operand to be UniformValue in CostModelAnalysis pass.