Page MenuHomePhabricator

[X86][CostModel] Try to fix cost computation load/stores of non-power-of-two vectors
ClosedPublic

Authored by lebedev.ri on Thu, Apr 8, 5:10 AM.

Details

Summary

Sometimes LV has to produce really wide vectors,
and sometimes they end up being not powers of two.
As it can be seen from the diff, the cost computation
is currently completely non-sensical in those cases.

I don't really know what i'm doing, but does this look better?

Instead of just scalarizing everything, split/factorize the wide vector
into a number of subvectors, each one having a power-of-two elements,
recurse to get the cost of op on this subvector. Also, check how we'd
legalize this subvector, and if the legalized type is scalar,
also account for the scalarization cost.

Diff Detail

Event Timeline

lebedev.ri created this revision.Thu, Apr 8, 5:10 AM
lebedev.ri requested review of this revision.Thu, Apr 8, 5:10 AM
ABataev added inline comments.Thu, Apr 8, 5:33 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3224–3232

Why not just something like this:

unsigned Factor = 0;
for (; NumElem > 0; NumElem -= Factor) {
  Factor = PowerOf2Floor(NumElem);
  .....
}

(yes, this doesn't help LV itself, still looking into that)

lebedev.ri marked an inline comment as done.

Simplify loop as proposed by @ABataev.
I guess there is a similar problem in X86TTIImpl::getMaskedMemoryOpCost().

I've figured out why this doesn't affect LV - because i should be fixing X86TTIImpl::getInterleavedMemoryOpCost().
Will look there too..

RKSimon added inline comments.Thu, Apr 15, 5:32 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3238

APInt::getBitsSet ?

lebedev.ri marked an inline comment as done.

@RKSimon thank you for taking a look!

Rebased, addressed nit.

Cheers @lebedev.ri, the premise seems fine, and the costs are a lot more sensible

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3224–3232

+1 Having Factor updated in the condition as well as being used in increment block is difficult to grok

lebedev.ri added inline comments.Thu, Apr 15, 6:24 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3224–3232

Note that i have addressed @ABataev's comment, it was about earlier patch version:
https://reviews.llvm.org/D100099?id=336063#change-OQEVJvBxQbDZ

This is probably still impresice for small remainder sub-vectors.
E.g. load cost for <3 x float> w/ 8 byte alignment should be 1: https://godbolt.org/z/r3ncvMvaf

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3224–3232

... or are you telling to move Factor computation from the condition?

3236

Hm, i wonder if we also need to add getShuffleCost(SK_ExtractSubvector cost.
(with wide vector ty widened to next power of two)

This is probably still impresice for small remainder sub-vectors.
E.g. load cost for <3 x float> w/ 8 byte alignment should be 1: https://godbolt.org/z/r3ncvMvaf

That assumes 16-byte alignment though - for v3f32 the more likely (element) 4-byte alignment will be worse - not sure whether we have much difference alignment coverage (or whether its worth it).

The costs are a lot better than what they were, but there are a few cases that seem off - if I had to guess I'd say we're not completely getting the split types matching the legalized types?

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3224–3232

If you can, but it looks like making the dependencies between NumElemLeft and Factor simpler won't be easy - I'm happy for you to keep it as.

llvm/test/Analysis/CostModel/X86/load_store.ll
26–39

cost = 2 ? SSE max size is 128-bit vector - so no subvector extraction cost - <3 x double> seems to get it right?

lebedev.ri marked an inline comment as done.Thu, Apr 15, 10:41 AM

This is probably still impresice for small remainder sub-vectors.
E.g. load cost for <3 x float> w/ 8 byte alignment should be 1: https://godbolt.org/z/r3ncvMvaf

That assumes 16-byte alignment though - for v3f32 the more likely (element) 4-byte alignment will be worse - not sure whether we have much difference alignment coverage (or whether its worth it).

Sorry, i meant 16, yes.

The costs are a lot better than what they were, but there are a few cases that seem off - if I had to guess I'd say we're not completely getting the split types matching the legalized types?

llvm/test/Analysis/CostModel/X86/load_store.ll
26–39

This happens because for vectors we assume that extracting 0'th element of a FP vector is free:
https://github.com/llvm/llvm-project/blob/4f42d873c20291077f5a1ed37b102330d505f00d/llvm/lib/Target/X86/X86TargetTransformInfo.cpp#L3055-L3065

I've added some 64-byte aligned tests which we can probably add special case load costs - please can you rebase?

For the remaining edge cases - would you prefer to fix them here or iterate in follow up patches? I'd be happy to accept this as is (after rebase) - its a massive improvement and makes it much easier to tweak as we go on.

llvm/test/Analysis/CostModel/X86/load_store.ll
26–39

Ah - of course - we're hitting the issue that the extract + store costs are treated separately, but for f32/f64 extract_0 is free.

lebedev.ri marked 3 inline comments as done.

@RKSimon thank you for talking a look!
Rebased.
Since the next changes would reduce the cost, not increase it, i think they should go into next patch.

RKSimon accepted this revision.Fri, Apr 16, 5:22 AM

LGTM

This revision is now accepted and ready to land.Fri, Apr 16, 5:22 AM

LGTM

Thank you for the review.
I'll look into improving subvector load costs.

lebedev.ri closed this revision.Sun, Apr 18, 2:15 AM

This was committed in b06c55a6986e0e1d571663eec507664013b22f00 with proper Differential Revision: , i'm not sure why phab didn't pick it up.

srj added a subscriber: srj.Tue, Apr 20, 1:31 PM
srj added a comment.Tue, Apr 20, 1:47 PM

This commit appears to have injected a hang (or > 60s delay) in certain Halide tests when using the JIT (the "hang" being inside LLVM code, but JIT-generated code). I'm working on finding a small repro case.

This commit appears to have injected a hang (or > 60s delay) in certain Halide tests when using the JIT (the "hang" being inside LLVM code, but JIT-generated code). I'm working on finding a small repro case.

Hm, interesting. Could you please also check if that still reproduces with D100684 (awaiting review) ?

srj added a comment.Tue, Apr 20, 2:11 PM

Hm, interesting. Could you please also check if that still reproduces with D100684 (awaiting review) ?

Testing now

srj added a comment.Tue, Apr 20, 2:12 PM

Hm, interesting. Could you please also check if that still reproduces with D100684 (awaiting review) ?

Testing now

Sadly, no, still hangs indefinitely with that patched in

Hm, interesting. Could you please also check if that still reproduces with D100684 (awaiting review) ?

Testing now

Sadly, no, still hangs indefinitely with that patched in

Thanks for checking. Awaiting the reproducer.

srj added a comment.Tue, Apr 20, 2:21 PM

OK, here's a file that I think will repro it:

To see the hang, do ~/llvm-13-install/bin/llc -mcpu=penryn -o - -O3 bad.ll (Note that the cpu / microarchtecture matters here; the hang for us was only when specializing for SSE4.1 SIMD)

OK, here's a file that I think will repro it:

To see the hang, do ~/llvm-13-install/bin/llc -mcpu=penryn -o - -O3 bad.ll (Note that the cpu / microarchtecture matters here; the hang for us was only when specializing for SSE4.1 SIMD)

Hm, that's because this isn't the patch you're looking for <jedi hand-wave>.
Please rebisect, reverting this doesn't make that hang go away for me.

srj added a comment.Tue, Apr 20, 3:09 PM

Please rebisect, reverting this doesn't make that hang go away for me.

Huh, weird. I'll recheck.

srj added a comment.Tue, Apr 20, 3:40 PM

Please rebisect, reverting this doesn't make that hang go away for me.

When reverting, how did you resolve the (many) conflicts in load_store.ll? I'm not familiar with that code and am unsure of the correct resolution.

srj added a comment.Tue, Apr 20, 4:02 PM

Update:

  • rerunning bisect still points at this commit.
  • it looks like the reproducer I uploaded does indeed hang llc quite a ways back -- even llc from an LLVM12 build hangs in the same way.
  • running llc at -O0 succeeds, but -O1 or higher still fails.
  • Looks like we're triggering exponential runtime somewhere under TargetLowering::SimplifyDemandedBits.

Given this behavior, perhaps this is a pre-existing bug in the optimizer, which was never triggered before until this change "unmasked" it?

I'm trying to figure out if there's a way to get you a repro case without requiring you to build Halide locally (since the hang occurs when Halide uses MCJIT).

If you don't mind pulling and building Halide locally, it is easy to repro; here are steps in case you want to try (assumes a linux env):

  • git clone https://github.com/halide/Halide
  • cd Halide
  • git checkout srj/hang-repro # This is a branch I made to simplify the repro
  • export LLVM_CONFIG=/path/to/llvm/install/dir
  • export HL_JIT_TARGET=x86-64-linux-sse41
  • make -j$(nproc) correctness_vector_reductions

Note that this target runs with a 60-second timeout (which is generous, as it should normally be well under a second).

srj added a comment.Tue, Apr 20, 4:20 PM

FYI, I've opened a bug (https://bugs.llvm.org/show_bug.cgi?id=50049) for the llc hang, as it seems like a clear problem regardless of whether this commit is involved.

If it hangs all the way back, then the bisection *can not* point to this revision, and this could not have triggered/exposed this.

Since we can reproduce the hang with only llc, we know that this patch is not the real source of the bug. I'm reducing a test based on PR50049.

srj added a comment.Wed, Apr 21, 10:27 AM

If it hangs all the way back, then the bisection *can not* point to this revision, and this could not have triggered/exposed this.

Using llc on the .ll file I uploaded does indeed hang quite a ways back.

Using Halide to drive MCJIT , however, is not the same code path, and while I can't explain it, that path *does* trigger a hang starting at this revision.

As mentioned above, it doesn't seem implausible that the changes made here happened to uncover a pre-existing bug that we didn't know about before.

srj added a comment.Wed, Apr 21, 12:07 PM

Commit a511b55cfd67acecc58f1ccf1f3ce5c917dc1d90 fixes both the llc hang and MCJIT-specific hang. Thanks for the attention.