This is an archive of the discontinued LLVM Phabricator instance.

[TTI] The cost model should not assume illegal vector casts get completely scalarized
ClosedPublic

Authored by mkuper on Jun 10 2016, 4:50 PM.

Details

Summary

If a vector cast gets split, it's quite possible that the resulting casts are legal and cheap.
So, instead of pessimistically assuming scalarization, we use the costs the concrete TTI provides for the split vector.

This looks like it does the right thing for AVX (a lot of overblown costs drop dramatically) - but I'm less sure about ARM.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 60426.Jun 10 2016, 4:50 PM
mkuper retitled this revision from to [TTI] The cost model should not assume illegal vector casts get completely scalarized.
mkuper updated this object.
mkuper added a subscriber: llvm-commits.
jmolloy edited edge metadata.Jun 13 2016, 1:10 AM

The ARM test changes look vaguely sensible to me.

arsenm added inline comments.Jun 14 2016, 6:08 PM
test/Analysis/CostModel/AMDGPU/addrspacecast.ll
39 ↗(On Diff #60426)

Pretty much everything should be scalarized. The vector insert and extracts are supposed to be free (and the cost is reported as 0 for those) so I think adding the one there is inconsistent and should check the extract/insert cost

mkuper added inline comments.Jun 14 2016, 6:34 PM
test/Analysis/CostModel/AMDGPU/addrspacecast.ll
39 ↗(On Diff #60426)

Until now, we assumed scalarization, but I think this is actually the rare case in practice. If the platform cares about vectors, I'd expect it to support most vector operations at least at some vector width, so it usually won't scalarize. And if we assume partial splitting instead of scalarization, using the insert/extract costs will be the wrong thing, regardless of how imprecise "1" is for splitting (and it's definitely imprecise, but it's what the generic getTypeLegalizationCost() uses).

We could trace the entire legalization chain, and see whether the end result is a vector or a scalar, and then use either 1 or the getScalarizationOvehead() based on that, but I'm not a huge fan of that.
(Is full scalarization common on AMDGPU, or is this a corner case? If it's common, perhaps we should specialize this for the AMDGPU TTI.)

arsenm added inline comments.Jun 17 2016, 1:53 PM
test/Analysis/CostModel/AMDGPU/addrspacecast.ll
39 ↗(On Diff #60426)

There are no vector operations.Vectors are only for loading and storing, every operation is scalar

39 ↗(On Diff #60426)

There's also no additional scalarization cost since access of a vector element is just access of a subregister

Could someone on the X86 side verify that this is sane-ish?
Some costs dropped by a very large factor, and the new costs look better, but I'd appreciate another set of eyes.

test/Analysis/CostModel/AMDGPU/addrspacecast.ll
39 ↗(On Diff #60426)

Ah, I see.

I could add a TTI hook for "split cost", and set that to 0 for AMDGPU. But I'm not entirely happy with that either. How does that sound to you?
Any other suggestions that will make this work correctly for both AMDGPU and platforms that legalize vectors partially?

mkuper updated this revision to Diff 61313.Jun 20 2016, 3:29 PM
mkuper edited edge metadata.

Updated to not affect AMDGPU, by specializing the "split cost".

Unfortunately, we can't just plug in getVectorSplitCost() (either this or a more sophisticated version) into getTypeLegalizationCost().

This is because most of the users of getTypeLegalizationCost() apparently don't use the returned value as a "cost". Rather, they seem assume it's just the number of post-legalization pieces, and multiply per-piece costs by that. So that needs a completely separate clean-up first.
AMDGPU itself is also guilty of this - it multiplies by getTypeLegalizationCost().first, so if getTypeLegalizationCost(LegalType) were to return 0 - which it ought to, if we really consider it the "cost" of legalization - things would get seriously out of whack. But this happens across all targets.

RKSimon edited edge metadata.Jun 21 2016, 8:17 AM

The SSE2 costs look very high which I guess is due to missing cost entries - that should be fixable in a pre/post patch if you prefer to keep them separate. The changed x86 costs all look reasonable.

test/Analysis/CostModel/X86/sitofp.ll
72 ↗(On Diff #61313)

This looks very high, codegen looks like it takes 26 ops - I haven't checked the throughput.

Thanks, Simon!
And you're right about the SSE2 costs, I'll look into that separately.

arsenm edited edge metadata.Jun 22 2016, 6:35 PM

Updated to not affect AMDGPU, by specializing the "split cost".

Unfortunately, we can't just plug in getVectorSplitCost() (either this or a more sophisticated version) into getTypeLegalizationCost().

This is because most of the users of getTypeLegalizationCost() apparently don't use the returned value as a "cost". Rather, they seem assume it's just the number of post-legalization pieces, and multiply per-piece costs by that. So that needs a completely separate clean-up first.
AMDGPU itself is also guilty of this - it multiplies by getTypeLegalizationCost().first, so if getTypeLegalizationCost(LegalType) were to return 0 - which it ought to, if we really consider it the "cost" of legalization - things would get seriously out of whack. But this happens across all targets.

I was thinking that was to get the cost of the split. Everything should be cost * NElts * scalar cost, so e.g. <8 x float> -> <4 x float>, <4 x float> = 2 * 4 * cost

arsenm added inline comments.Jun 22 2016, 6:39 PM
test/Analysis/CostModel/ARM/cast.ll
267 ↗(On Diff #61313)

LGTM, but it looks to me like this should be adding 0, so not increasing by 1?

Unfortunately, we can't just plug in getVectorSplitCost() (either this or a more sophisticated version) into getTypeLegalizationCost().

This is because most of the users of getTypeLegalizationCost() apparently don't use the returned value as a "cost". Rather, they seem assume it's just the number of post-legalization pieces, and multiply per-piece costs by that. So that needs a completely separate clean-up first.
AMDGPU itself is also guilty of this - it multiplies by getTypeLegalizationCost().first, so if getTypeLegalizationCost(LegalType) were to return 0 - which it ought to, if we really consider it the "cost" of legalization - things would get seriously out of whack. But this happens across all targets.

I was thinking that was to get the cost of the split. Everything should be cost * NElts * scalar cost, so e.g. <8 x float> -> <4 x float>, <4 x float> = 2 * 4 * cost

What exactly do you mean when you say "cost of the split"?
My expectation was for getTypeLegalizationCost() to return an approximation of the total cost of legalization (cost of INSERT_SUBVECTOR and EXTRACT_SUBVECTOR operations), that is, an additive factor. So, it would make sense to have "legalization cost + NElts * scalar cost". Or, in the more general case, when we're not fully scalarizing, "legalizaiton cost + NPieces * piece cost". Instead, what it's actually used for is to get NPieces. This is even explicitly referred to as the "split factor" in one of the X86 TTI uses:

std::pair<int, MVT> IdxsLT = TLI->getTypeLegalizationCost(DL, IndexVTy);
std::pair<int, MVT> SrcLT = TLI->getTypeLegalizationCost(DL, SrcVTy);
int SplitFactor = std::max(IdxsLT.first, SrcLT.first);

I'll have to go over all the uses of getTypeLegalizationCost(). If everything uses it as a proxy for the split factor, it's probably just a matter of changing the name to indicate what it really does. If not, then we may need two different functions, one to get the split factor, and one to get the *cost* of performing the splits.
But I think that's independent of this patch.

test/Analysis/CostModel/ARM/cast.ll
267 ↗(On Diff #61313)

Generally, with the new formula, it makes sense to have costs like 2 * 32 + 1, so I didn't pay too much attention to those little changes (what concerned me more were the big drops e.g. 64 -> 11 on line 326). But you're right, I need to verify that this is really reasonable and not some unexpected artifact.
Thanks!

delena edited edge metadata.Jun 23 2016, 2:37 PM

Could you, please, add a test for "sext <16 x i32 > to <16 x i64 >" for Skylake-avx512 and see what happens?
The cost should be 2. I know that this case did not work correctly and in many cases prevented vectorization to 16.

Could you, please, add a test for "sext <16 x i32 > to <16 x i64 >" for Skylake-avx512 and see what happens?
The cost should be 2. I know that this case did not work correctly and in many cases prevented vectorization to 16.

The cost is now 3 (instead of the old 48). It's not 2 because of the getVectorSplitCost() fudge factor.
We'll probably need to tune this in the future, as I said before, this is just a first approximation. It's possible that 0 is the right value most of the time.

If you don't mind, I'll add the test as a separate patch - we don't only want this specific test, we need tests for a bunch of sexts/zexts, like ARM has.

test/Analysis/CostModel/ARM/cast.ll
267 ↗(On Diff #61313)

So it mostly makes sense.
This used to be evaluated as fully scalarizing, with a per-element scalarization cost of 6, and cast cost of 10. So, (6 + 10) * 4 = 64.
Now we evaluate it as 1 + 2 * (cost(fptoui <2 x float> to <2 x i64>)). But the 2-wide cast is still considered fully scalarizing (even though the types are now legal), so we get 1 + 2 * (2 * (10 + 6)) = 65.

If you don't mind, I'll add the test as a separate patch - we don't only want

>this specific test, we need tests for a bunch of sexts/zexts, like ARM has.
>

[Demikhovsky, Elena] yes, you can add tests later. You can look at this revision as a test reference for X86

http://reviews.llvm.org/D15604

Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

So, just to make sure - at this point, everybody involved is "vaguely OK" with this, and there are no objections, right?

This revision was automatically updated to reflect the committed changes.