This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Allow masked gathers only if allowed by target.
ClosedPublic

Authored by ABataev on Apr 26 2021, 7:41 AM.

Details

Summary

Need to check if target allows/supports masked gathers before trying to
emit it, otherwise may generate less performant code for the targets
where masked gathers are slower than regular gathers.

Part of D57059.

Diff Detail

Event Timeline

ABataev created this revision.Apr 26 2021, 7:41 AM
ABataev requested review of this revision.Apr 26 2021, 7:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2021, 7:41 AM
spatel added inline comments.Apr 27 2021, 5:41 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
264

Seems like an improvement, but do you know what happens on this example? There's no masked gather, but that interfered with the cost calc for forming the vector add or store?

ABataev added inline comments.Apr 27 2021, 5:45 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
264

Yes, the target does not support masked gather and we overestimate the cost of the masked memory operations. Because of that, we could not vectorize this sequence previously. With legality checks, simple gather makes it profitable for vectorization.

We seem to go back and forth on legality vs. pure cost model to decide what is allowed for vectorization.
This would undo part of D90445, so adding potential reviewers for more feedback.

We seem to go back and forth on legality vs. pure cost model to decide what is allowed for vectorization.
This would undo part of D90445, so adding potential reviewers for more feedback.

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

We seem to go back and forth on legality vs. pure cost model to decide what is allowed for vectorization.
This would undo part of D90445, so adding potential reviewers for more feedback.

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Is there some gather construct that is impossible for the backend to expand correctly or just that it can't be expanded efficiently?

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed. Yes, we have "unlegal" gathers for llvm ir, but they are lowered to legal code eventually. What is the motivation for this change?

We seem to go back and forth on legality vs. pure cost model to decide what is allowed for vectorization.
This would undo part of D90445, so adding potential reviewers for more feedback.

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Is there some gather construct that is impossible for the backend to expand correctly or just that it can't be expanded efficiently?

There is known problem that after landing of the patch with masked gathers some of the code produces wrong results, so I suppose yes, there is a problem with lowering. Generally speaking, would be good to calculate the cost/lower masked gathers to simple gathers if it is not legal/not profitable to use masked gathers. And tweak the cost model for such a case. AS temp solution we can add simple legality checks, at least, without early checks for the cost.

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed. Yes, we have "unlegal" gathers for llvm ir, but they are lowered to legal code eventually. What is the motivation for this change?

The cost model must be tweaked to support this kind of lowering. I can rework the patch in terms of cost model improvement to handle the case with non-legal masked gather more correctly instead of direct legality checks in SLP. Will it work?

We seem to go back and forth on legality vs. pure cost model to decide what is allowed for vectorization.
This would undo part of D90445, so adding potential reviewers for more feedback.

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Is there some gather construct that is impossible for the backend to expand correctly or just that it can't be expanded efficiently?

There is known problem that after landing of the patch with masked gathers some of the code produces wrong results, so I suppose yes, there is a problem with lowering.

Is there a bugreport with reproducer?

Generally speaking, would be good to calculate the cost/lower masked gathers to simple gathers if it is not legal/not profitable to use masked gathers. And tweak the cost model for such a case. AS temp solution we can add simple legality checks, at least, without early checks for the cost.

We seem to go back and forth on legality vs. pure cost model to decide what is allowed for vectorization.
This would undo part of D90445, so adding potential reviewers for more feedback.

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Is there some gather construct that is impossible for the backend to expand correctly or just that it can't be expanded efficiently?

There is known problem that after landing of the patch with masked gathers some of the code produces wrong results, so I suppose yes, there is a problem with lowering.

Is there a bugreport with reproducer?

Not sure about it. @anton-afanasyev?

Generally speaking, would be good to calculate the cost/lower masked gathers to simple gathers if it is not legal/not profitable to use masked gathers. And tweak the cost model for such a case. AS temp solution we can add simple legality checks, at least, without early checks for the cost.

We seem to go back and forth on legality vs. pure cost model to decide what is allowed for vectorization.
This would undo part of D90445, so adding potential reviewers for more feedback.

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Is there some gather construct that is impossible for the backend to expand correctly or just that it can't be expanded efficiently?

There is known problem that after landing of the patch with masked gathers some of the code produces wrong results, so I suppose yes, there is a problem with lowering.

Is there a bugreport with reproducer?

I just saw 2 reports about it, the first one in the original patch (IIRC) and another one in my patch that just triggered some extra vectorizations and again revealed the issue.

Generally speaking, would be good to calculate the cost/lower masked gathers to simple gathers if it is not legal/not profitable to use masked gathers. And tweak the cost model for such a case. AS temp solution we can add simple legality checks, at least, without early checks for the cost.

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed. Yes, we have "unlegal" gathers for llvm ir, but they are lowered to legal code eventually. What is the motivation for this change?

The cost model must be tweaked to support this kind of lowering. I can rework the patch in terms of cost model improvement to handle the case with non-legal masked gather more correctly instead of direct legality checks in SLP. Will it work?

I believe it will work correct after lowering fixes. I'm still investigating this case https://bugs.llvm.org/show_bug.cgi?id=50015, this is definitly lowering issue since llvm code looks correct.
Ok, I think this patch is good temp solution until miscompile issue isn't solved.

Is there a bugreport with reproducer?

Here it is https://bugs.llvm.org/show_bug.cgi?id=50015

Pure cost won't work here, we need to check if this is legal just to try to generate masked gather at all. Cost problem is different.

Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed. Yes, we have "unlegal" gathers for llvm ir, but they are lowered to legal code eventually. What is the motivation for this change?

The cost model must be tweaked to support this kind of lowering. I can rework the patch in terms of cost model improvement to handle the case with non-legal masked gather more correctly instead of direct legality checks in SLP. Will it work?

I believe it will work correct after lowering fixes. I'm still investigating this case https://bugs.llvm.org/show_bug.cgi?id=50015, this is definitly lowering issue since llvm code looks correct.
Ok, I think this patch is good temp solution until miscompile issue isn't solved.

Maybe better to rework it and land it in X86TargetTransfromInfo.cpp for better cost modelling if it is not legal directly?

I believe it will work correct after lowering fixes. I'm still investigating this case https://bugs.llvm.org/show_bug.cgi?id=50015, this is definitly lowering issue since llvm code looks correct.
Ok, I think this patch is good temp solution until miscompile issue isn't solved.

Maybe better to rework it and land it in X86TargetTransfromInfo.cpp for better cost modelling if it is not legal directly?

Yes, I believe this will fix PR50015, though it doesn't fix the whole problem related to lowering.

I believe it will work correct after lowering fixes. I'm still investigating this case https://bugs.llvm.org/show_bug.cgi?id=50015, this is definitly lowering issue since llvm code looks correct.
Ok, I think this patch is good temp solution until miscompile issue isn't solved.

Maybe better to rework it and land it in X86TargetTransfromInfo.cpp for better cost modelling if it is not legal directly?

Yes, I believe this will fix PR50015, though it doesn't fix the whole problem related to lowering.

No, I don't think it will fix PR50015, it may hide it in some cases, yes. My main goal is to improve vectorization in general. Ok, I will rework it and land it in in a cost model for masked gathers.

Yes, I believe this will fix PR50015, though it doesn't fix the whole problem related to lowering.

No, I don't think it will fix PR50015, it may hide it in some cases, yes. My main goal is to improve vectorization in general. Ok, I will rework it and land it in in a cost model for masked gathers.

Yes, I mean "hide" here. The cost there is almost zero, so cost tuning will lift it under threshold.

ABataev updated this revision to Diff 340965.Apr 27 2021, 1:15 PM

Changed the way we estimate the cost of ScatterVectorize nodes, if the masked gather is not legal it will be lowered to a gather, so need to estimate a cost to gather too.

llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
55

Hmm, so this is even less strict. Then pr50015 isn't fixed as well.

ABataev added inline comments.Apr 28 2021, 4:57 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47629-inseltpoison.ll
55

Yes, because currently we incorrectly calculate the cost of masked gathers for SLP. If the masked gather is not legal/profitable, TTI returns its scalarized cost, which is gather cost + cost of scalar loads, while for SLP we should end up with just gather cost. This prevents some vectorizations if masked gather is not legal and/or profitable.
Plus, this patch never meant to fix pr50015, just to improve the vectorization.

anton-afanasyev accepted this revision.May 3 2021, 5:51 AM

Ok, looks good.

This revision is now accepted and ready to land.May 3 2021, 5:51 AM
This revision was automatically updated to reflect the committed changes.

I think it is a question to the lowering of masked gathers. Generally speaking, this patch should just restore the original code size/performance we had before masked gathers commit (better for old archs, which do not natively support masked gathers but have fast gathers).

uabelho added a subscriber: uabelho.May 4 2021, 1:13 AM

I'm seeing a strange case with this patch where I now get calls to

llvm.masked.gather.v2i16.v2p0i16

after SLP vectorizer and before we didn't?
Masked gather is not legal on my target. Something seems to be backwards?

I'm seeing a strange case with this patch where I now get calls to

llvm.masked.gather.v2i16.v2p0i16

after SLP vectorizer and before we didn't?
Masked gather is not legal on my target. Something seems to be backwards?

In my case the cost as computed by TTI->getGatherScatterOpCost (so the cost prior to this patch) is 6 but calculated in the new way (since masked gather is not legal) I get the cost 0. getGatherCost(VL) returns 2 and then getEntryCost(It->get()) 2 as well so VecLdCost now ends up as 0.

I'm seeing a strange case with this patch where I now get calls to

llvm.masked.gather.v2i16.v2p0i16

after SLP vectorizer and before we didn't?
Masked gather is not legal on my target. Something seems to be backwards?

In my case the cost as computed by TTI->getGatherScatterOpCost (so the cost prior to this patch) is 6 but calculated in the new way (since masked gather is not legal) I get the cost 0. getGatherCost(VL) returns 2 and then getEntryCost(It->get()) 2 as well so VecLdCost now ends up as 0.

If the masked gather is not legal, it should be converted to a regular gather by scalarizeMaskedGather function (called by ScalarizeMaskedMemIntrinLegacyPass::runOnFunction), that's why we calculate it as a regular gather

I'm seeing a strange case with this patch where I now get calls to

llvm.masked.gather.v2i16.v2p0i16

after SLP vectorizer and before we didn't?
Masked gather is not legal on my target. Something seems to be backwards?

In my case the cost as computed by TTI->getGatherScatterOpCost (so the cost prior to this patch) is 6 but calculated in the new way (since masked gather is not legal) I get the cost 0. getGatherCost(VL) returns 2 and then getEntryCost(It->get()) 2 as well so VecLdCost now ends up as 0.

If the masked gather is not legal, it should be converted to a regular gather by scalarizeMaskedGather function (called by ScalarizeMaskedMemIntrinLegacyPass::runOnFunction), that's why we calculate it as a regular gather

So you say it is expected that with this patch, for a target where masked gather is not legal, the SLP Vectorizer should now suddenly introduce @llvm.masked.gather where it didn't before?

uabelho added a comment.EditedMay 4 2021, 4:27 AM

E.g.

opt -slp-vectorizer -S -o - slp2.ll -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900

now results in

define void @foo(%rec* %in_p) #0 {
entry:
  %ptr1 = getelementptr inbounds %rec, %rec* %in_p, i16 0, i32 2
  %ptr2 = getelementptr inbounds %rec, %rec* %in_p, i16 0, i32 0
  %0 = insertelement <2 x i16*> poison, i16* %ptr2, i32 0
  %1 = insertelement <2 x i16*> %0, i16* %ptr1, i32 1
  %2 = call <2 x i16> @llvm.masked.gather.v2i16.v2p0i16(<2 x i16*> %1, i32 1, <2 x i1> <i1 true, i1 true>, <2 x i16> undef)
  store <2 x i16> %2, <2 x i16>* bitcast (%rec* @g to <2 x i16>*), align 1
  ret void
}

But before this patch it didn't use llvm.masked.gather.
And as far as I can see masked gather is not legal for this target.

I'm seeing a strange case with this patch where I now get calls to

llvm.masked.gather.v2i16.v2p0i16

after SLP vectorizer and before we didn't?
Masked gather is not legal on my target. Something seems to be backwards?

In my case the cost as computed by TTI->getGatherScatterOpCost (so the cost prior to this patch) is 6 but calculated in the new way (since masked gather is not legal) I get the cost 0. getGatherCost(VL) returns 2 and then getEntryCost(It->get()) 2 as well so VecLdCost now ends up as 0.

If the masked gather is not legal, it should be converted to a regular gather by scalarizeMaskedGather function (called by ScalarizeMaskedMemIntrinLegacyPass::runOnFunction), that's why we calculate it as a regular gather

So you say it is expected that with this patch, for a target where masked gather is not legal, the SLP Vectorizer should now suddenly introduce @llvm.masked.gather where it didn't before?

Well, generally speaking, yes. Just if it is not legal, it should be scalarized and turned into a regular gather later. And the patch calculates the cost of the gather, if it is not legal

I'm seeing a strange case with this patch where I now get calls to

llvm.masked.gather.v2i16.v2p0i16

after SLP vectorizer and before we didn't?
Masked gather is not legal on my target. Something seems to be backwards?

In my case the cost as computed by TTI->getGatherScatterOpCost (so the cost prior to this patch) is 6 but calculated in the new way (since masked gather is not legal) I get the cost 0. getGatherCost(VL) returns 2 and then getEntryCost(It->get()) 2 as well so VecLdCost now ends up as 0.

If the masked gather is not legal, it should be converted to a regular gather by scalarizeMaskedGather function (called by ScalarizeMaskedMemIntrinLegacyPass::runOnFunction), that's why we calculate it as a regular gather

So you say it is expected that with this patch, for a target where masked gather is not legal, the SLP Vectorizer should now suddenly introduce @llvm.masked.gather where it didn't before?

Well, generally speaking, yes. Just if it is not legal, it should be scalarized and turned into a regular gather later. And the patch calculates the cost of the gather, if it is not legal

Ok. A little surprising to me given the patch headline:

[SLP]Allow masked gathers only if allowed by target.

because the opposite is what is actually happening.

I saw a regression for my out-of-tree target with this patch, that's why it caught my eye.

E.g.

opt -slp-vectorizer -S -o - slp2.ll -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900

now results in

define void @foo(%rec* %in_p) #0 {
entry:
  %ptr1 = getelementptr inbounds %rec, %rec* %in_p, i16 0, i32 2
  %ptr2 = getelementptr inbounds %rec, %rec* %in_p, i16 0, i32 0
  %0 = insertelement <2 x i16*> poison, i16* %ptr2, i32 0
  %1 = insertelement <2 x i16*> %0, i16* %ptr1, i32 1
  %2 = call <2 x i16> @llvm.masked.gather.v2i16.v2p0i16(<2 x i16*> %1, i32 1, <2 x i1> <i1 true, i1 true>, <2 x i16> undef)
  store <2 x i16> %2, <2 x i16>* bitcast (%rec* @g to <2 x i16>*), align 1
  ret void
}

But before this patch it didn't use llvm.masked.gather.
And as far as I can see masked gather is not legal for this target.

Well, this should not happen because of the tiny vectorization tree. I'll revert the patch and will add some extra detection/rework it for better cost estimation.

E.g.

opt -slp-vectorizer -S -o - slp2.ll -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900

now results in

define void @foo(%rec* %in_p) #0 {
entry:
  %ptr1 = getelementptr inbounds %rec, %rec* %in_p, i16 0, i32 2
  %ptr2 = getelementptr inbounds %rec, %rec* %in_p, i16 0, i32 0
  %0 = insertelement <2 x i16*> poison, i16* %ptr2, i32 0
  %1 = insertelement <2 x i16*> %0, i16* %ptr1, i32 1
  %2 = call <2 x i16> @llvm.masked.gather.v2i16.v2p0i16(<2 x i16*> %1, i32 1, <2 x i1> <i1 true, i1 true>, <2 x i16> undef)
  store <2 x i16> %2, <2 x i16>* bitcast (%rec* @g to <2 x i16>*), align 1
  ret void
}

But before this patch it didn't use llvm.masked.gather.
And as far as I can see masked gather is not legal for this target.

Well, this should not happen because of the tiny vectorization tree. I'll revert the patch and will add some extra detection/rework it for better cost estimation.

Ok, thanks!