This is an archive of the discontinued LLVM Phabricator instance.

Changes in conversion cost model for X86 target
AbandonedPublic

Authored by delena on Dec 16 2015, 11:59 PM.

Details

Summary

The current cost calculation for the conversion instructions has many problems

  • The provided numbers are inaccurate
  • Huge numbers are given for vector split

I changed the approach for vector cost calculation:

  • If the original types are simple - check them first
  • if the original vector should be split - take the legal types and multiply by split factor
  • split factor is the max factor between source and destination
  • do not put v8i32 -> v4i64 cost. The calculated cost should be 2 * (v4i32 -> v4i64).

I also checked all SSE numbers and put the real number of instructions instead.
I understand, that instruction latency is different in SSE2 and AVX, but
the matter of this cost model to let the vectorizer to choose the right VF (compare VFx to VFy for the same target).

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 43102.Dec 16 2015, 11:59 PM
delena retitled this revision from to Changes in conversion cost model for X86 target.
delena updated this object.
delena added reviewers: congh, hfinkel, andreadb, nadav.
delena set the repository for this revision to rL LLVM.
delena added a subscriber: llvm-commits.
congh edited edge metadata.Dec 17 2015, 4:08 PM

This patch overall looks very nice! Please check my inline comments.

../lib/Target/X86/X86TargetTransformInfo.cpp
837

Those two statements can be merge into one:

std::tie(SrcSplitFactor, SrcVT) = LTSrc;

851

Use a lambda to avoid code duplication?

Also should the last parameter be Src?

861

Split SrcVT?

864

Instead of further splitting SrcVT/DstVT, should we do the opposite? For example, given SrcVT = v16i16, and DstVT = v16i32, and on SSE2 we have SrcSplitFactor = 2, and DstSplitFactor = 4. Now suppose we have an cost entry for SrcVT = v8i16 and DstVT = v8i32, should we use this entry to estimate the cost? What you are doing here is making SrcVT = v4i16 and DstVT = v4i32, which gives less precise cost.

867

Is this necessary? When we reach here, SrcVT/DstVT should be the same as OrgSimpleSrcVT/OrgSimpleDstVT, so I think it is safe to let CheckOriginalTypes keep its true value.

nadav added inline comments.Dec 17 2015, 4:22 PM
../lib/Target/X86/X86TargetTransformInfo.cpp
812–831

Elena, I don't understand the logic below. It looks like you are re-implementing type legalization in the code model code. Why not just use getTypeLegalizationCost ?

Can you give an example where the existing logic fails?

delena added a subscriber: delena.Dec 17 2015, 11:36 PM

The getTypeLegalizationCost() shows only split-factor for vectors. It does not show promotion, sign-extend, zero-extend, widening cost.
Now, let's assume that we have vector-spilt. I take the max split-factor between source and destination. (And this is right, agree?)

In case of promotion/widening the getTypeLegalizationCost() does not help. It just show the new types after promotion.
For example v4i8 sext to v4i32, the type legalization result will be v4i32 sext to v4i32.
Or v2i8 to v2f32 - the type legalization will convert the operation to " v2i64 to v4f32 ".
It is very difficult to describe the right cost in this mode. If you don't have any automatic engine for the cost estimation, it is easier to put the right cost for v2i8 to v2f32.

The only one thing that I, probably, forgot is
V8i16 -> v8f64 on AVX2
The result will be
V4i32 to v4f64 and I want to see v4i16 to v4f64. This way I'll be able to add SEXT cost from v4i16 to v4i32.

Why did I start to rewrite?

Because one operation
%a = sext i32 %b to i64
Does not allow to vectorize the whole loop to 16, just because the cost of "%a = sext <16 x i32>%b to <16 x 64>" is 300 instead of 2.

  • Elena
nadav edited edge metadata.Dec 18 2015, 9:05 AM

Thanks for explaining this Elena. Have you considered handling all of the special cases by adding them to the 'TypeConversionCostTblEntry' table? Also, have you considered improving getTypeLegalizationCost?

I re-checked and fixed many entries in SSE and AVX tables. I do not plan to rewrite getTy[eLegalizationCost right now.

  • Elena
delena marked 3 inline comments as done.Dec 20 2015, 6:57 AM
delena added inline comments.
../lib/Target/X86/X86TargetTransformInfo.cpp
867

Ok.

delena updated this revision to Diff 43328.Dec 20 2015, 6:59 AM

I changed the code according to Cong's comments.

Elena, I don't understand your comment. getTypeLegalizationCost imitates the type legalizer by splitting and promoting. It actually uses the type legalizer. What do you mean it does not support zero-extend and sign-extend?

Are you saying that it does not handle dag-combine peepholes that we have? Please explain what problem your patch is solving.

I suspect that you can commit changes to the cost table not as part of this patch. But please do not modify the code that uses getTypeLegalizationCost without explaining what it does.

Hi Nadav,

The getTypeLegalizationCost() returns a pair:
Res.first is a Split-Factor
Res.second is a legal MVT

Let's assume that the original type v4i8. The Res.second will show MVT::4i32. Res.first = 1.
Now, 4i8 is promoted to 4i32 and you don't know the real cost of this promotion. And the getTypeLegalizationCost() can't provide this information.
Just because it does not know how to promote - sign-extend or zero-extend.
SITOFP requires sign-extend, UITOFP requires zero-extend.

The real information that we receive from getTypeLegalizationCost() is the split-factor. I use it in the following way:
Example 1:
SSE2:

Sitofp <4 x i32 > %a to <4 x f64>

The split-factor of destination is 2. How do I calculate the cost?

2 * costof (sitofp <2 x i32 > %a to <2 x f64>) + ExtraSplitCost

The ExtraSplitCost is the cost of splitting vector %a.

Example 2:
AVX2:

Sitofp <8 x i64 > %a to <8 x f64>

The split-factor of source and destination is 2. How do I calculate the cost?

2 * costof (sitofp <4 x i64 > %a to <4 x f64>)
The ExtraSplitCost=0 because the source and the destination both come in 2 registers, I don't need any additional shuffle.

Example 3:
SSE2:
Sitofp <4 x i8 > to <4 x f32>
For the source getTypeLegalizationCost() returns (1, MVT::4i32)
For the dest getTypeLegalizationCost() returns (1, MVT::4f32)

The original version returns
1 * costof(Sitofp <4 x i32 > to <4 x f32>)
I changed it to
1 * costof(Sitofp <4 x i8 > to <4 x f32>)

Now, after my changes the following 2 operations
Sitofp <4 x i8 > to <4 x f32>
And
Sitofp <4 x i32 > to <4 x f32>
Have the different cost.

The model, I'm proposing does more exact cost estimation. I use getTypeLegalizationCost() for split-factor only.
I can't add one or two lines that I'm missing, because the current model is wrong.
This is the original wrong code:

> EVT SrcTy = TLI->getValueType(DL, Src);
> EVT DstTy = TLI->getValueType(DL, Dst);
  
> // The function getSimpleVT only handles simple value types.
> if (!SrcTy.isSimple() || !DstTy.isSimple())
>   return BaseT::getCastInstrCost(Opcode, Dst, Src);
  • Elena
nadav added a comment.Dec 21 2015, 9:00 AM

Elena,

We don't let the instruction Sitofp <4 x i32 > %a to <4 x f64> go through the type legalizer.

We handle it as a DAGCombine optimization. DAGCombine optimizations are represented as entries in the table.

-Nadav

The current cost model gives cost 40.
The new model gives 3.

This is the real code (SSE2)
sitofpv4i32v4double: # @sitofpv4i32v4double

.cfi_startproc

BB#0:

cvtdq2pd        %xmm0, %xmm2
pshufd  $78, %xmm0, %xmm0       # xmm0 = xmm0[2,3,0,1]
cvtdq2pd        %xmm0, %xmm1
movaps  %xmm2, %xmm0
retq
  • Elena

Elena, please fix the cost table and don't change anything else.

delena abandoned this revision.Dec 21 2015, 10:46 AM

The proposed solution was not accepted.