Page MenuHomePhabricator

[SLP]Cost for a constant buildvector.
Needs ReviewPublic

Authored by ABataev on Jun 2 2022, 7:56 AM.

Details

Summary

Usually, constant buildvector results in a vector load from a
constant/data pool.

Diff Detail

Event Timeline

ABataev created this revision.Jun 2 2022, 7:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 7:56 AM
ABataev requested review of this revision.Jun 2 2022, 7:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2022, 7:56 AM

Some unfortunate regressions

llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll
146 ↗(On Diff #433753)
vdmitrie added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5933

This estimate should be bit more complicated.
Here are the things that can additionally be considered:

  • for scalar floating point ops a constant operand is normally loaded from memory too.
  • if it is an operand of instruction that becomes immediate (like shift value) and is splat - cost is zero.
  • for a scalar integer op a constant operand is typically an immediate, so this estimate works in most cases but there is an exception: 64 bits operations on a 32bits target. That should be taken into account too.
ABataev updated this revision to Diff 434112.Jun 3 2022, 12:46 PM

Address comments.

ABataev added inline comments.Jun 3 2022, 12:51 PM
llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll
146 ↗(On Diff #433753)

llvm-mca reports throughputs:
For scalar code - 5.5
AVX vector - 8.0
AVX2 vector - 5.0

https://godbolt.org/z/rEc74dxza

xbolva00 added inline comments.Jun 3 2022, 1:03 PM
llvm/test/Transforms/SLPVectorizer/X86/pr46983.ll
146 ↗(On Diff #433753)

Ah, right, these checks are for avx.

Test case for collection.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3802

Note that it is free for splat shift value only.
https://godbolt.org/z/KMf6sW5n3

Test case for collection.

mca shows that these 2 instructions has the same cost, so it actually doers not matter. Probably worth to add some other instructions, which can load params directly from memory for x86

vdmitrie added inline comments.Jun 3 2022, 2:23 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5937

Just wondering is that possible for UserTreeIndices to be empty here? AFAIU it can be for root only but constants do not seed vtree.
if alternate opcodes are for shl/shr but shift value is splat it is still can be immediate for both of them.

5956

drop it?

ABataev added inline comments.Jun 3 2022, 2:42 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5937
  1. If constants are reduced values in reduction ops.
  2. That's why there is a TODO above.
5956

What do you mean?

vdmitrie added inline comments.Jun 3 2022, 2:47 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5937

okay. Although I believe it is not SLP vectorizer job to do constant folding.

5956

Drop extra definition of ScalarCost.
Otherwise loop at line 5807 is updating variable from 5802, but it is not used.
LIne 5811 will subtract one defined at 5795.

ABataev added inline comments.Jun 3 2022, 2:50 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5956

Ah, yes, sure.

vdmitrie added inline comments.Jun 3 2022, 3:14 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5961

Isn't this interface already puts assumption that a constant is a legal immediate?
I was trying to explore this too and I found that it does not seem to cover correctly 32bit target specifically for 64bit operations. Ideally we should have interface that tells whether immediate is a legal imm operand for a target but I have not found anything like that.
One way to figure this (which I found -may be wrongful) is
when condition
DL->getTypeStoreSizeInBits(ScalarTy) > DL->getLargestLegalIntTypeSizeInBits()
is true we cannot assume operand as a legal immediate.

ABataev added inline comments.Jun 3 2022, 3:24 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5961

I'll check it.

ABataev added inline comments.Jun 6 2022, 9:02 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5937

Do you suggest to hide it in getConstBuildVectorInstrCost? And return the difference? Or just add a new member function?

vdmitrie added inline comments.Jun 6 2022, 11:34 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5937

Alternate opcodes is SLP vectorizer specific. For that reason trying to sink that logic into inside the TTI interface does not look like right thing to do.
But outlining this whole new code into a separate member is a good idea.

What sounds weird for me is that constants may seed vtree for reduction. Although that is not directly related to this patch but you are placing here work arounds of that. IMO it is unpractical to run constants reduction through SLP vectorizer machinery.

Probably, to make the work around of that issue simpler in this patch, add an early return:
if (E->UserTreeIndices.empty())

return 0;

Otherwise it will be returning memory-op cost for a foldable operation.

ABataev updated this revision to Diff 435177.Jun 8 2022, 8:19 AM

Address comments

ABataev added inline comments.Jun 8 2022, 8:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5937

What sounds weird for me is that constants may seed vtree for reduction.

InstCombiner and other passes are not always able to handle them (or require some extra work and compile time). E.g:

define i32 @foo(i32 %v, i32 %a) {
  %s1 = add i32 %v, 1
  %s2 = add i32 %a, 2
  %s3 = add i32 %s1, %s2
  %s11 = add i32 %v, %a
  %s31 = add i32 %s3, %s11
  %s4 = add i32 %v, 3
  %s5 = add i32 %a, 4
  %s6 = add i32 %s4, %s5
  %s7 = add i32 %s31, %s6
  ret i32 %s7
}

SLP is able to transform it to:

define i32 @foo(i32 %v, i32 %a) {
  %1 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> <i32 4, i32 3, i32 2, i32 1>)
  %op.rdx = add i32 %a, %a
  %op.rdx1 = add i32 %a, %v
  %op.rdx2 = add i32 %v, %v
  %op.rdx3 = add i32 %op.rdx, %op.rdx1
  %op.rdx4 = add i32 %op.rdx3, %op.rdx2
  %op.rdx5 = add i32 %1, %op.rdx4
  ret i32 %op.rdx5
}

which can be optimized

%1 = i32 10

But I agree, that it requires improvement. We don't need to estimate the cost and emit reduction here. I have a patch that improves it. Need to work on it for some time, though.

RKSimon added inline comments.Jun 16 2022, 3:32 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3802

Yes, vector shifts must be splats without AVX2 or XOP

3822

Instruction::Add/Sub?

Also, we'd need to allow Idx ==0 || Idx == 1 for commutable ops.

ABataev added inline comments.Jun 16 2022, 10:17 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3822
  1. I excluded Add/Sub here because scalar Add/Sub with Imm has less cost than the vector Add/Subs (0.2-0.33 vs ~0.5)
  2. We can add it later, currently no such kind of analysis in getIntImmCostInst
ABataev updated this revision to Diff 437675.Jun 16 2022, 1:03 PM

Address comments

RKSimon added inline comments.Jun 22 2022, 10:10 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3815

XOP can memory fold from Idx == 0 as well.

ABataev updated this revision to Diff 439156.Jun 22 2022, 2:16 PM

Removed getConstantBuildVectorCost, the analysis for constant values already exists in getArithmeticInstrCost. Added support for const operand for stores in getMemoryOpCost function.

RKSimon added inline comments.Mon, Aug 15, 4:15 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
4100

Just do this once before the !VTy || !LT.second.isVector()) check?

ABataev updated this revision to Diff 452648.Mon, Aug 15, 6:27 AM

Address comment