This is an archive of the discontinued LLVM Phabricator instance.

[CostModel] Add basic implementation of getGatherScatterOpCost.
ClosedPublic

Authored by fhahn on Nov 23 2020, 11:01 AM.

Details

Summary

Add a basic implementation of getGatherScatterOpCost to BasicTTIImpl.

The implementation estimates the cost of scalarizing the loads/stores,
the cost of packing/extracting the individual lanes and the cost of
only selecting enabled lanes.

This more accurately reflects the current cost on targets like AArch64.

Diff Detail

Event Timeline

fhahn created this revision.Nov 23 2020, 11:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 23 2020, 11:01 AM
fhahn requested review of this revision.Nov 23 2020, 11:01 AM

Hmm. I still don't think it's a great idea to try to SLP vectorize using gather operations if the target does not have any gathers! Having a better base cost does sound good though.

I think with a non-constant mask, the cost should be higher? I don't think you can just do it with a select, it will include i1 extracts and branches over whether the lanes are on or off.

fhahn updated this revision to Diff 307294.Nov 24 2020, 2:58 AM

Updated cost computation for variable masks, also include the cost of extracting the addresses from the address vector.

Hmm. I still don't think it's a great idea to try to SLP vectorize using gather operations if the target does not have any gathers! Having a better base cost does sound good though.

I think how we treat it in SLP exactly is a separate issue. I think assigning a more accurate cost will go a long way towards avoiding generating gathers in practice, unless they are beneficial even with the bad lowering.

I think with a non-constant mask, the cost should be higher? I don't think you can just do it with a select, it will include i1 extracts and branches over whether the lanes are on or off.

Yes it probably should better account for the fact that the mem ops need to be executed conditionally. Not really sure how to best cost that. I updated the cost to include extracting the individual conditions, doing a branch and a PHI to combine the result. With that, the cost is not really that much higher. Perhaps something else is missing, or it might be better to just return a sufficiently high number in that case?

I think assigning a more accurate cost will go a long way towards avoiding generating gathers in practice, unless they are beneficial even with the bad lowering.

In the past, I believe the only thing to create gather's would be the loop vectorizer, and it would only do so if isLegalMaskedGather returned true, which guards us against a lot of bad costs. More accurate cost modelling now that it is being used more generally is certainly a good thing.

Yes it probably should better account for the fact that the mem ops need to be executed conditionally. Not really sure how to best cost that. I updated the cost to include extracting the individual conditions, doing a branch and a PHI to combine the result. With that, the cost is not really that much higher. Perhaps something else is missing, or it might be better to just return a sufficiently high number in that case?

The costs of branches and phi's is often 0 unfortunately (due to branch predictors and phis being folded away and whatnot) - they make a good default for the normal type of branch/phi in a loop or if. With so many branches together here that might not be true and the real cost perhaps should be higher. Hmm. Not sure what do suggest. Do we think these scores are high enough? (at least for the moment, we can always adjust them later). As far as I understand the SLP vectorizer will use constant masks which makes it the more immediate concern.

I think assigning a more accurate cost will go a long way towards avoiding generating gathers in practice, unless they are beneficial even with the bad lowering.

In the past, I believe the only thing to create gather's would be the loop vectorizer, and it would only do so if isLegalMaskedGather returned true, which guards us against a lot of bad costs. More accurate cost modelling now that it is being used more generally is certainly a good thing.

Once the cost-model is more reasonable, we might want to also just let loop vectorize rely on the cost. Using the costs more widely should flush out any potential issues much quicker.

Yes it probably should better account for the fact that the mem ops need to be executed conditionally. Not really sure how to best cost that. I updated the cost to include extracting the individual conditions, doing a branch and a PHI to combine the result. With that, the cost is not really that much higher. Perhaps something else is missing, or it might be better to just return a sufficiently high number in that case?

The costs of branches and phi's is often 0 unfortunately (due to branch predictors and phis being folded away and whatnot) - they make a good default for the normal type of branch/phi in a loop or if. With so many branches together here that might not be true and the real cost perhaps should be higher. Hmm. Not sure what do suggest. Do we think these scores are high enough? (at least for the moment, we can always adjust them later). As far as I understand the SLP vectorizer will use constant masks which makes it the more immediate concern.

Yeah, the estimate is very rough. I think they costs should be sufficiently high so that it won't get picked in practice, but if there's a better way to model the fact that we need to introduce multiple blocks & branches, I'm happy to adjust the code. Alternatively we could also use a very high number for the VariableMask case. In any case, this should be a substantial improvement to returning 1 as cost, as we do now.

dmgreen accepted this revision.Nov 25 2020, 6:32 AM

Yep. We can improve the costs in the future as we need, and this seems like an excellent start to me.

LGTM.

This revision is now accepted and ready to land.Nov 25 2020, 6:32 AM

Thanks! I'm planning on landing this in tomorrow unless there are any other comments. I'll also add a comment that the estimate for the conditional execution is a bit rough.