This is an archive of the discontinued LLVM Phabricator instance.

[x86] try harder to scalarize a vector load with extracted integer op uses
ClosedPublic

Authored by spatel on Jan 27 2022, 8:19 AM.

Details

Summary

extract_vec_elt (load X), C --> scalar load (X+C)

As noted in the comment, DAGCombiner has this fold -- and the code in this patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but x86 should benefit even if the loaded vector has other uses as long as we apply some other x86-specific conditions. The motivating example from #50310 is shown in vec_int_to_fp.ll.

I'm still looking over the diffs, but they all seem like wins so far.

Diff Detail

Event Timeline

spatel created this revision.Jan 27 2022, 8:19 AM
spatel requested review of this revision.Jan 27 2022, 8:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2022, 8:19 AM

Thanks for taking this on!

The loss of the shuffles is my main concern - I don't think EltsFromConsecutiveLoads can help us recover from this yet :(

llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll
1725 ↗(On Diff #403661)

yuck

llvm/test/CodeGen/X86/pr45378.ll
80–81

We might be able to get this as well if you can move the fold inside combineExtractWithShuffle ?

spatel marked an inline comment as done.Jan 27 2022, 12:57 PM
spatel added inline comments.
llvm/test/CodeGen/X86/avx512-shuffles/partial_permute.ll
1725 ↗(On Diff #403661)

Yes, this diff suggests we should limit the fold a bit more. In this case, the extracts are immediately converted back to vector via scalar_to_vector or insert_vector_elt, and the sequence doesn't become a shuffle until later.

I suspect there's no 1 right answer about where to draw the line, but this is easy to avoid - we can just make a list of bailout user opcodes (in this 1st draft, it was only ISD::STORE).

spatel updated this revision to Diff 403767.Jan 27 2022, 1:04 PM
spatel marked an inline comment as done.

Patch updated:
Try to avoid a missed shuffle opportunity by bailing out on more user opcodes.

spatel marked an inline comment as done.Jan 27 2022, 1:28 PM
spatel added inline comments.
llvm/test/CodeGen/X86/pr45378.ll
80–81

With only SSE2 (but not the other RUNs), the extract of element 1 is not legal, so it becomes a shuffle before we see it.

We might be able to adjust the isAfterLegalizeDAG predicate and get this, but that could also lead to unintended diffs. I'd prefer to go with the more conservative approach at first, so we don't jeopardize fixing the motivating bug.

pengfei accepted this revision.Jan 27 2022, 5:36 PM

Thanks for the improvement! LGTM.

llvm/lib/Target/X86/X86ISelLowering.cpp
43238–43240

Add a regression test for it?

This revision is now accepted and ready to land.Jan 27 2022, 5:36 PM
RKSimon accepted this revision.Jan 28 2022, 2:19 AM

LGTM - please can you fix the clang-format warnings? Cheers.

spatel marked an inline comment as done.Jan 28 2022, 5:00 AM
spatel added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
43238–43240

There are examples of this in the existing tests diffs - we have both a scalar load and vector load from the same address.
See:
vec_cast.ll -> define <3 x i16> @h(<3 x i32> %a)
vec_int_to_fp.ll -> define <2 x double> @sitofp_load_2i64_to_2f64(<2 x i64> *%a)

We might still be able to do better on those tests, but that demonstrates what I suggested in this comment. Does that make sense? I can try to come up with another case if the existing tests are not clear.

pengfei added inline comments.Jan 28 2022, 5:47 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
43238–43240

I see your point, it's fine. Thank you.

spatel updated this revision to Diff 403992.Jan 28 2022, 5:59 AM

Patch updated:

  1. Fixed formatting (must have an expired flavor of clang-format on my system).
  2. Added dedicated/minimal test to show the extra load vs. register transfer trade-off: @multi_use_load_scalarization
spatel marked 2 inline comments as done.Jan 28 2022, 6:06 AM
spatel added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
43238–43240

You're right, and it's not difficult to add a dedicated test. :)
e9768a2a44a1
The other tests may not show the minimum difference if this transform changes in the future.

This revision was landed with ongoing or failed builds.Jan 28 2022, 7:23 AM
This revision was automatically updated to reflect the committed changes.
spatel marked an inline comment as done.
haowei added a subscriber: haowei.Feb 2 2022, 2:40 PM

Hi, we found out that this patch breaks following tests in the Fuchsia's stage-2 mac clang builders:

Clang :: Refactor/Extract/ExtractExprIntoFunction.cpp
Clang :: Refactor/Extract/ExtractionSemicolonPolicy.cpp
Clang :: Refactor/Extract/ExtractionSemicolonPolicy.m
Clang :: Refactor/Extract/FromMethodToFunction.cpp
Clang :: Refactor/Extract/ObjCProperty.m

An example of the failure: https://ci.chromium.org/ui/p/fuchsia/builders/prod/clang-mac-x64/b8823769353085181809/overview . And test error messages can be found at https://ci.chromium.org/ui/p/fuchsia/builders/prod/clang-mac-x64/b8823769353085181809/test-results . These failures only happens in stage 2 build so it took us some time to bisect and confirm it. Could you take a look please? And if it takes a long time to fix, could you revert this change first please? Thanks.

spatel added a comment.Feb 3 2022, 5:05 AM

Hi, we found out that this patch breaks following tests in the Fuchsia's stage-2 mac clang builders:

Clang :: Refactor/Extract/ExtractExprIntoFunction.cpp
Clang :: Refactor/Extract/ExtractionSemicolonPolicy.cpp
Clang :: Refactor/Extract/ExtractionSemicolonPolicy.m
Clang :: Refactor/Extract/FromMethodToFunction.cpp
Clang :: Refactor/Extract/ObjCProperty.m

An example of the failure: https://ci.chromium.org/ui/p/fuchsia/builders/prod/clang-mac-x64/b8823769353085181809/overview . And test error messages can be found at https://ci.chromium.org/ui/p/fuchsia/builders/prod/clang-mac-x64/b8823769353085181809/test-results . These failures only happens in stage 2 build so it took us some time to bisect and confirm it. Could you take a look please? And if it takes a long time to fix, could you revert this change first please? Thanks.

Thanks for letting me know. I don't have a system set up to debug a stage 2 failure, so this is going to take a while unless someone else has a way to diagnose/reduce the errors more quickly. I will revert within a day if there is no update/progress.

haowei added a comment.Feb 3 2022, 7:02 PM

Hi, we found out that this patch breaks following tests in the Fuchsia's stage-2 mac clang builders:

Clang :: Refactor/Extract/ExtractExprIntoFunction.cpp
Clang :: Refactor/Extract/ExtractionSemicolonPolicy.cpp
Clang :: Refactor/Extract/ExtractionSemicolonPolicy.m
Clang :: Refactor/Extract/FromMethodToFunction.cpp
Clang :: Refactor/Extract/ObjCProperty.m

An example of the failure: https://ci.chromium.org/ui/p/fuchsia/builders/prod/clang-mac-x64/b8823769353085181809/overview . And test error messages can be found at https://ci.chromium.org/ui/p/fuchsia/builders/prod/clang-mac-x64/b8823769353085181809/test-results . These failures only happens in stage 2 build so it took us some time to bisect and confirm it. Could you take a look please? And if it takes a long time to fix, could you revert this change first please? Thanks.

Thanks for letting me know. I don't have a system set up to debug a stage 2 failure, so this is going to take a while unless someone else has a way to diagnose/reduce the errors more quickly. I will revert within a day if there is no update/progress.

I am not an expert on Mac. I did some experiments. Using the clang built from this revision (b4b97ec813a02585000f30ac7d532dda74e8bfda) to build stage 2 clang, regardless of the revision of the second stage clang, will cause clang-refactor related test to fail. So I think this patch introduced a bug that changed clang's code gen behavior and cause clang to generate buggy code on the Mac. I still don't quite understand why the clang-refactor tool is the only one affected so far. If you need any binaries from stage 1 and 2 build to debug this issue, please let me know. I can help build them for you.

spatel added a comment.Feb 4 2022, 5:06 AM

I am not an expert on Mac. I did some experiments. Using the clang built from this revision (b4b97ec813a02585000f30ac7d532dda74e8bfda) to build stage 2 clang, regardless of the revision of the second stage clang, will cause clang-refactor related test to fail. So I think this patch introduced a bug that changed clang's code gen behavior and cause clang to generate buggy code on the Mac. I still don't quite understand why the clang-refactor tool is the only one affected so far. If you need any binaries from stage 1 and 2 build to debug this issue, please let me know. I can help build them for you.

Thanks - I'll take any help I can get to reduce the bug. :)
I have reverted the patch on main, and I think I should also revert it on the 14 branch (if you can confirm that the bug is present resolved with that revert).

Is it correct that the bug is only visible on the Mac build (the same tests are not failing on other platforms)?
It's not clear to me from the logs what the failure is: all 5 of the tests in "clang/test/Refactor/Extract" cause clang-refactor to crash?

dyung added a subscriber: dyung.Feb 4 2022, 11:22 AM

Just a heads up that we potentially saw an issue with this commit on some of our internal tests. I've bisected it down to your commit, but I need to make sure it reproduces with an opensource compiler (without our private changes), and if so I will post a bug with the information.

Just a heads up that we potentially saw an issue with this commit on some of our internal tests. I've bisected it down to your commit, but I need to make sure it reproduces with an opensource compiler (without our private changes), and if so I will post a bug with the information.

Thanks! That would be great. I'm guessing this failed to account for some combination of load/store ordering, but I can't see the bug yet.

lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
43260

I think i don't quite understand what is going on here,
but i really don't understand why we'd want to
replace SDValue(LoadVec, 1) (which is, i guess, LoadVec->getChain()?)
with the new Chain?
We only simply replace the extract with a new load,
why could we possibly invalidate the other memory chains?

This *reeks*. (Yes, this is copied from DAGCombiner::scalarizeExtractedVectorLoad())

haowei added a comment.Feb 4 2022, 2:49 PM

I am not an expert on Mac. I did some experiments. Using the clang built from this revision (b4b97ec813a02585000f30ac7d532dda74e8bfda) to build stage 2 clang, regardless of the revision of the second stage clang, will cause clang-refactor related test to fail. So I think this patch introduced a bug that changed clang's code gen behavior and cause clang to generate buggy code on the Mac. I still don't quite understand why the clang-refactor tool is the only one affected so far. If you need any binaries from stage 1 and 2 build to debug this issue, please let me know. I can help build them for you.

Thanks - I'll take any help I can get to reduce the bug. :)
I have reverted the patch on main, and I think I should also revert it on the 14 branch (if you can confirm that the bug is present resolved with that revert).

Is it correct that the bug is only visible on the Mac build (the same tests are not failing on other platforms)?
It's not clear to me from the logs what the failure is: all 5 of the tests in "clang/test/Refactor/Extract" cause clang-refactor to crash?

I can confirm the stage 2 builds are green after the revert. I am doing some local builds to see if I can generate stage 2 clang-refactor binaries with and without your patch so I can hand them to you for comparison. I am also working on simpler steps to reproduce the issue without Fuchsia related stuff to it can be easier to debug. Will file a bug in github once I get there.

efriedma added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
43260

LoadVec->getChain() is the chain operand; SDValue(LoadVec, 1) is the chain result.

I suspect the issue here is that the chains aren't connected correctly. The current code RAUW's the chain output of LoadVec with the chain output of Load... and leaves the chain of LoadVec dangling. If LoadVec isn't dead, this is wrong; the chain result of LoadVec needs to stay connected.

I was able to reproduce the issue with a small testcase that I have filed as issue 53695. Hope that helps.

I was able to reproduce the issue with a small testcase that I have filed as issue 53695. Hope that helps.

That was very helpful - thanks!
If I'm seeing it correctly, that example shows that the bug fix for this patch requires using 'DAG.makeEquivalentMemoryOrdering(OriginalLoad, Load)'.
But the test example can also be modified slightly to expose a bug in the DAGCombiner code that this patch copied from. Ie, there's a visible miscompile even with this patch theoretically fixed.
I haven't been able to come up with a test that exposes the bug independently of applying this patch yet though.

spatel reopened this revision.Feb 11 2022, 9:32 AM
This revision is now accepted and ready to land.Feb 11 2022, 9:32 AM
spatel updated this revision to Diff 407920.Feb 11 2022, 9:36 AM

Patch updated:

  1. Use makeEquivalentMemoryOrdering (see D119549 for a similar change to the DAGCombiner code)
  2. Added a regression test based on https://github.com/llvm/llvm-project/issues/53695 (with the old version of this patch, this test will fail).
spatel requested review of this revision.Feb 11 2022, 9:36 AM
spatel marked 2 inline comments as done.

The latest patch no longer causes clang-refactor test failures on stage2 mac build on Fuchsia builders.

The latest patch no longer causes clang-refactor test failures on stage2 mac build on Fuchsia builders.

Great - thank you for checking!

llvm/test/CodeGen/X86/extractelement-load.ll
341–345

This is the new test based on @dyung 's reduction. In the old version of this patch, a movl load from @zero was placed after the vmovaps %ymm0, zero(%rip)

RKSimon accepted this revision.Feb 12 2022, 9:49 AM

LGTM - cheers

This revision is now accepted and ready to land.Feb 12 2022, 9:49 AM
This revision was landed with ongoing or failed builds.Feb 13 2022, 5:34 AM
This revision was automatically updated to reflect the committed changes.

We are noticing some failures in internal unit tests with this change.

Tests pass at 588f121ada6d541 but start failing at b4b97ec813a0258.

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 8:17 AM

We are noticing some failures in internal unit tests with this change.

Tests pass at 588f121ada6d541 but start failing at b4b97ec813a0258.

Oh, the patch was already reverted at 7b037250978. I'll test it again and report back if there are issues.

We are noticing some failures in internal unit tests with this change.

Tests pass at 588f121ada6d541 but start failing at b4b97ec813a0258.

Oh, the patch was already reverted at 7b037250978. I'll test it again and report back if there are issues.

@manojgupta Please can you confirm that everything is OK after the patch was recommitted at rGc486b82cfbe59929a80e5f29bab82112555c8bf4