This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improve costmodel for scalar byte swaps
ClosedPublic

Authored by lebedev.ri on May 5 2021, 10:24 AM.

Details

Summary

Currently we model i16 bswap as very high cost (10),
which doesn't seem right, with all other being at 1.

Regardless of MOVBE, i16 reg-reg bswap is lowered into
(an extending move plus) rot-by-8:
https://godbolt.org/z/8jrq7fMTj
I think it should at worst have throughput of 1:

Since i32/i64 already have cost of 1,
MOVBE doesn't improve their costs any further.

BUT, MOVBE must have at least a single memory operand, with other being a register.
Which means, if we have a bswap of load, iff load has a single use, we'll fold bswap into load.
Likewise, if we have store of a bswap, iff bswap has a single use, we'll fold bswap into store.
So i think we should treat such a bswap as free.

Diff Detail

Event Timeline

lebedev.ri created this revision.May 5 2021, 10:24 AM
lebedev.ri requested review of this revision.May 5 2021, 10:24 AM
lebedev.ri edited the summary of this revision. (Show Details)May 5 2021, 10:30 AM

Currently we don't model i16 bswap as very high cost (10),
which doesn't seem right, with all other being at 1.

Was that supposed to say "we model i16 as very high cost(10)" with the don't?

i8 reg-reg

i16 reg-reg?

It looks like i64 BSWAP on Intel Core is 2 uops. Possibly one uop to swap the bytes in the upper and lower halves separately followed by rotate by 32 to exchange the halves.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
2927

At least on Intel Core CPUs, MOVBE isn't optimized. It's a load or store and a bswap operation. Maybe it's optimized on Atom/Silvermont/Goldmont? It was added to that line of CPU first possibly because those CPUs have been used in networking equipment.

llvm/test/Analysis/CostModel/X86/bswap-store.ll
76

These check lines vanished and were not replaced

113

Same here

llvm/test/Analysis/CostModel/X86/load-bswap.ll
121

And here

llvm/test/Transforms/SLPVectorizer/X86/arith-add-usat.ll
29

And here

RKSimon added inline comments.May 5 2021, 10:57 AM
llvm/test/Analysis/CostModel/X86/load-bswap.ll
90

Script being stupid again - you'll need to fix your prefixes

lebedev.ri added inline comments.May 5 2021, 11:13 AM
llvm/test/Analysis/CostModel/X86/load-bswap.ll
90

UGH, i thought i did :(
We need to disable FileCheck strict mode for x86 tests.

lebedev.ri edited the summary of this revision. (Show Details)May 5 2021, 2:04 PM
lebedev.ri added inline comments.May 5 2021, 2:15 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
2927

Looking at actual AMD Zen3 measurements, movbe r<-m is 1 uop, while movbe m<-r is 2,
which is actually a regression from Zen2/Zen1, as per https://www.agner.org/optimize/instruction_tables.pdf.

As per that table, both are really slow on haswell/broadwell/skylake*,
but fast on Silvermont/Goldmont*/KNL.

So i think we could mark movbe r<-m on AMD's at least.

2927

So i think we could mark movbe r<-m *as free* on AMD's at least.

lebedev.ri updated this revision to Diff 343216.May 5 2021, 3:42 PM
lebedev.ri marked an inline comment as done.

Addressing review note - introduce FeatureFastMOVBE, set it on selected CPU models.

lebedev.ri updated this revision to Diff 343220.May 5 2021, 3:53 PM

Dumb down check prefixes...

Random comment: This has me wishing that we'd made further progress with D46276 - I'd much prefer being able to peer into the scheduler models for costs than relying on yet more feature flags :(

llvm/lib/Target/X86/X86.td
519

You say 'prefer' - is a future intention to alter isel depending on this? I'm not sure that's actually useful.

llvm/test/Transforms/SLPVectorizer/X86/bitreverse.ll
30

AVX?

lebedev.ri marked an inline comment as done.May 6 2021, 6:36 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86.td
519

You say 'prefer' - is a future intention to alter isel depending on this?

Not really. I was only interested in doing what is contained in this patch,
i.e. modelling bswap as free in costmodel, if it will be free.

RKSimon added inline comments.May 6 2021, 7:01 AM
llvm/lib/Target/X86/X86.td
519

I guess technically different codegen might still occur if the costs come out slightly different and the vectorizer 'prefers' scalar code - sorry for the noise.

lebedev.ri marked 2 inline comments as done.May 6 2021, 7:03 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86.td
519

That wasn't noise. "Do you intend to prohibit formation of movbe if not HasFastMOVBE" seems like a good question.
(i do not.)

lebedev.ri marked an inline comment as done.May 6 2021, 2:02 PM
lebedev.ri added a subscriber: magabari.

Random comment: This has me wishing that we'd made further progress with D46276 - I'd much prefer being able to peer into the scheduler models for costs than relying on yet more feature flags :(

I know, right? E.g. even when looking at the most simplest snippet from

@RKSimon @magabari I'd like to add some more tuples, but i have a question: how are the costs actually derived?
For example, the assembly for interleaved load of i16 w/ stride 2: https://godbolt.org/z/hjb3d5x6E
What's it cost? I'm guessing it's not just 10, aka the instruction count excluding the loads/stores?
Is it 5 from Block RThroughput: 4.8 from MCA: https://godbolt.org/z/fxYcEj3Wx ?
Which CPU should be used for these numbers?

I believe they were taken from IACA probably with a Haswell CPU - a reciprocal throughput from llvm-mca should be similar.

Usually with cost tables we tend to compare numbers from similar spec CPUs (AVX2 - Haswell/Ryzen) and choose the worst.....

I see. So in this case we have:

therefore for that tuple we choose 9, correct? I'm not seeing any other sched models for AVX2 but not AVX512 CPU's.

And another question: now that we've established the rules, should i be submitting these changes through review,
or committing these directly? I fear former would either result in bulky patches that are hard to review,
or saturate the review queue.

... the difference between haswell and zen3 is 3x...
I don't really want to add yet another costmodel subset, but that difference is *NOT* great...

RKSimon added inline comments.May 7 2021, 9:20 AM
llvm/test/Transforms/SLPVectorizer/X86/arith-add-ssat.ll
89

why are these sat math tests changing?

lebedev.ri marked an inline comment as done.

Don't forget to check that we are actually looking at ISD::BSWAP

llvm/test/Transforms/SLPVectorizer/X86/arith-add-ssat.ll
89

Oops :)

RKSimon accepted this revision.May 8 2021, 3:00 AM

LGTM - cheers

This revision is now accepted and ready to land.May 8 2021, 3:00 AM

LGTM - cheers

Thank you for the review!

This revision was landed with ongoing or failed builds.May 8 2021, 5:18 AM
This revision was automatically updated to reflect the committed changes.