This is an archive of the discontinued LLVM Phabricator instance.

[X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again
ClosedPublic

Authored by lebedev.ri on Apr 16 2021, 1:40 PM.

Details

Summary

Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.

Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.

Diff Detail

Event Timeline

lebedev.ri created this revision.Apr 16 2021, 1:40 PM
lebedev.ri requested review of this revision.Apr 16 2021, 1:40 PM
lebedev.ri retitled this revision from [WIP][X86][CostModel] Further `X86TTIImpl::getMemoryOpCost()` improvements to [X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again.
lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri added reviewers: RKSimon, ABataev.

Ready for review.
This ended up being a complete rewrite, and it's kinda ugly.

Rebased over reworked test coverage, NFC.

This looks like it needs further cleanup tbh - I've made a few comments but there's more.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3245

Repeated VTy->getElementType() ?

3247

Merge these

assert((EltTyBits > 0) && ((EltTyBits % 8) == 0) && "Expected byte-size types");
3249

Can this assert happen?

3258

This feels like it should be unnecessary.....

3262

assert message?

3316

maybe pull out "bool IsLoad = Opcode == Instruction::Load;" style bools?

lebedev.ri marked 4 inline comments as done.

@RKSimon thank you for taking a look!
Trying to address review notes.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3249

No, the previous assert would have fired already.
This is more to document the expected post-condition.
It is inferrable from the previous assert, but i thought this is more direct.

3258

I don't want to have int NumEltDone, because then we will have to update two things (it and NumEltRemaining).
Should i inline it? I though giving it a name would make code more readable.

3262

Same as previous assert, can't really happen, but i thought making it explicit might be good..

lebedev.ri marked an inline comment as done.Apr 22 2021, 5:08 AM
ABataev added inline comments.Apr 28 2021, 5:02 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3238

legalization

3260

Can be [=]

3293

Assert message?

lebedev.ri marked 3 inline comments as done.

@ABataev thank you for taking a look!
Rebasing, addressing review notes.

As the newly-changed tests show, getInterleavedMemoryOpCost*() aren't actually using this function,
which means that before i can look into adding more tuples there,
it would be best to finish with this.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3260

We intentionally capture by-reference, because NumEltRemaining will change.

ABataev accepted this revision.May 11 2021, 4:48 AM

Looks good but maybe there are some other comments/concerns.

This revision is now accepted and ready to land.May 11 2021, 4:48 AM

Looks good but maybe there are some other comments/concerns.

Thank you for the review!

@RKSimon ?

RKSimon accepted this revision.May 11 2021, 5:31 AM

Mechanically I think its fine, I was trying to think of ways to clean it up as I don't think some of the code is as tidy as it could be, but we can iterate on that later on.

Thank you for the review!

Mechanically I think its fine, I was trying to think of ways to clean it up as I don't think some of the code is as tidy as it could be, but we can iterate on that later on.

Yeah, the code isn't quite up to the quality i prefer, but i don't see any obvious cleanup opportunities right now.
I think, this might have issues with weird types (not powers-of-two / not byte-sized), but i'm not sure how much we support those..

Rebased, NFC.

This revision was landed with ongoing or failed builds.May 11 2021, 6:03 AM
This revision was automatically updated to reflect the committed changes.
lebedev.ri added a comment.EditedMay 12 2021, 5:13 AM

@RKSimon i'm in need of a bit of guidance.

I'd like to maybe deal with getInterleavedMemoryOpCostAVX2() next, but i'm not sure what's the best way forward.
After thinking about it, it'm iffy about just adding more hardcoded entries to the costtable there.
We have element {i8, i16, i32, i64} * stride {2..6} * VF {8..64}. That's 80 entries already, by naive estimates.
This ignores partial strided loads (with Indices.size() != stride), and other vector widths.
Those will cause a basically exponential explosion.

Do we really want to proceed on that path?
I'm seeing two alternatives:

  1. Perhaps we should try to come up with an algorithmic approach, like we have here?
  2. Perhaps we should simply automate this? Run the strided load pattern through codegen, run that through exegesis, and automatically record it's performance?

IIRC the need for those interleaved costs was because we couldn't determine accurate 'lane crossing' vs 'non lane crossing' general shuffle costs - now that we have access to the shuffle mask it might be possible to recognise these a little better.

Thank you for replying!

IIRC the need for those interleaved costs was because we couldn't determine accurate 'lane crossing' vs 'non lane crossing' general shuffle costs - now that we have access to the shuffle mask it might be possible to recognise these a little better.

So the comment is that i should first try to costmodel generic shuffles? (X86TTIImpl::getShuffleCost())

Thank you for replying!

IIRC the need for those interleaved costs was because we couldn't determine accurate 'lane crossing' vs 'non lane crossing' general shuffle costs - now that we have access to the shuffle mask it might be possible to recognise these a little better.

So the comment is that i should first try to costmodel generic shuffles? (X86TTIImpl::getShuffleCost())

I have initial patch for the shuffles D100486

What is the actual problem you're trying to solve? Maybe refactoring the generic BasicTTIImplBase::getInterleavedMemoryOpCost implementation to use shuffles instead of extract+insert pairs might work?

What is the actual problem you're trying to solve? Maybe refactoring the generic BasicTTIImplBase::getInterleavedMemoryOpCost implementation to use shuffles instead of extract+insert pairs might work?

I'm trying to solve the problem of total lack of costs for e.g. i16, hopefully without manually deducing each entry for the table.

Thank you for replying!

IIRC the need for those interleaved costs was because we couldn't determine accurate 'lane crossing' vs 'non lane crossing' general shuffle costs - now that we have access to the shuffle mask it might be possible to recognise these a little better.

So the comment is that i should first try to costmodel generic shuffles? (X86TTIImpl::getShuffleCost())

I have initial patch for the shuffles D100486

That is good to hear!
However, i'm not actually sure if we can just use X86TTIImpl::getShuffleCost() here directly.
Won't the shuffle expansion result in overlapping shuffles, that we'll bill more than once?
I.e., don't we need to have our own implementation?

Thank you for replying!

IIRC the need for those interleaved costs was because we couldn't determine accurate 'lane crossing' vs 'non lane crossing' general shuffle costs - now that we have access to the shuffle mask it might be possible to recognise these a little better.

So the comment is that i should first try to costmodel generic shuffles? (X86TTIImpl::getShuffleCost())

I have initial patch for the shuffles D100486

That is good to hear!
However, i'm not actually sure if we can just use X86TTIImpl::getShuffleCost() here directly.
Won't the shuffle expansion result in overlapping shuffles, that we'll bill more than once?
I.e., don't we need to have our own implementation?

If we can generate in a more effective way, than a sequence of shuffles, then probably yes.

Thank you for replying!

IIRC the need for those interleaved costs was because we couldn't determine accurate 'lane crossing' vs 'non lane crossing' general shuffle costs - now that we have access to the shuffle mask it might be possible to recognise these a little better.

So the comment is that i should first try to costmodel generic shuffles? (X86TTIImpl::getShuffleCost())

I have initial patch for the shuffles D100486

That is good to hear!
However, i'm not actually sure if we can just use X86TTIImpl::getShuffleCost() here directly.
Won't the shuffle expansion result in overlapping shuffles, that we'll bill more than once?
I.e., don't we need to have our own implementation?

If we can generate in a more effective way, than a sequence of shuffles, then probably yes.

I'm not sure that is an answer for the the question i asked.
Let me reformulate the question:

Interleaved load, in IR, will have as many shuffles as the interleaving step.
Let's call then IL0, IL1, ...
But each one of those it won't be lowered into a single shuffle instruction,
but into a number of other instructions, let's call them ILI00, ILI01, ILI10, ILI10.

The question is: won't there be common instructions in lowered forms of each of the IL0, IL1,
i.e. could it be that ILI00 == ILI10?

If that happens, and we calculated the costs of IL0 and IL1 separately,
both of these costs will include ILI00, while it should only be included once.

fhahn added a subscriber: fhahn.May 13 2021, 2:12 AM
fhahn added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3231–3232

Should scalable vectors ever reach here? I would expect something went very wrong in that case, so maybe we can use cast instead?

3244

This assert triggers if a vector of i1 (e.g. <16 x i1>) is passed. EltTyBits will be 1.

The assert can be triggered by running opt -cost-model -analyze -mtriple=x86_64-unknown-linux-gnu on

define void @foo(<16 x i1> %v, <16 x i1>* %ptr) {
  store <16 x i1> %v, <16 x i1>* %ptr
  ret void
}

See https://llvm.godbolt.org/z/jxPvdGEW4 for a run-able version.

etyurin added inline comments.
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3244

Perhaps, DL.getTypeStoreSizeInBits should be used instead of getTypeSizeInBits.

fhahn added inline comments.May 13 2021, 1:09 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3244

I'm not sure. If it takes longer to resolve it might be good to revert it in the meantime.

lebedev.ri reopened this revision.May 13 2021, 2:03 PM
This revision is now accepted and ready to land.May 13 2021, 2:03 PM
lebedev.ri planned changes to this revision.May 13 2021, 2:04 PM

Adjusted handling to deal graciously with weird types.

This revision is now accepted and ready to land.May 20 2021, 6:55 AM
lebedev.ri marked 3 inline comments as done.May 20 2021, 6:59 AM

@fhahn added pretty exhaustive test coverage, and adjusted algo to mostly deal with everything. it doesn't crash at least
@RKSimon @ABataev ok to reland?

Observation: i only use the legalized vector size, but not the legalized vector element type, or the number of legalized vectors used.
I wonder if that is wrong.

I'm going to reland this tomorrow, and look into getInterleavedMemoryOpCost() next.