This is an archive of the discontinued LLVM Phabricator instance.

[ExpandMemCmp][MergeICmps] Move passes out of CodeGen into opt pipeline.
AbandonedPublic

Authored by courbet on Apr 5 2019, 8:05 AM.

Details

Summary

This opens up numerous possibilities for optimizations.

Event Timeline

courbet created this revision.Apr 5 2019, 8:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2019, 8:05 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
courbet edited the summary of this revision. (Show Details)Apr 5 2019, 8:08 AM

I agree that we shouldn't have to do anything too fancy here. Ideally, DAGCombiner would take care of this generally, so we're not optimizing this 1 special-case pattern that includes memcmp and ignoring the general case.

Did you investigate what that would take? We get this in instcombine, so we do have some set of peepholes to potentially copy.

Alternatively, what do you think about making ExpandMemCmp a late IR optimization pass like the vectorizer passes?
When I was looking at memcmp optimizations initially, it was clear that in some larger examples we would benefit from running instcombine/CSE after the expansion. By moving this pass up, we wouldn't have to worry about adding optimizations (like this patch) to the backend because it will all be done for us in existing IR optimization passes.

Alternatively, what do you think about making ExpandMemCmp a late IR optimization pass like the vectorizer passes?

The would be ideal, but unfortunately ExpandMemCmp requires access to the TargetLowering (for TargetLowering::MaxLoadsPerMemcmp which is consistent with what happens for memcpy and memset) . On the other hand, we already have some MemCmpExpansionOptions in TargetTransformInfo, which also knows about memcpy (e.g. getMemcpyLoopLoweringType(), getMemcpyLoopResidualLoweringType and getMemcpyCost), so we could move it here and move ExpandMemCmp to a late IR opt pass. I'll create a patch with this approach.

For reference (and some potential unit and end-to-end test ideas to show wins):
https://bugs.llvm.org/show_bug.cgi?id=36421
https://bugs.llvm.org/show_bug.cgi?id=34032#c13

courbet updated this revision to Diff 194696.Apr 11 2019, 8:20 AM

[ExpandMemCmp][MergeICmps] Move passes out of CodeGen into opt pipeline.

There is still some test fixing to do for Power, but before I do that I'd like to get your opinion on the approach, in particular regarding the pass placement (I pretty much placed it randomly here).

The nice thing here is that this discovered many more optimization opportunities (e.g. the length2 tests). I attached some benchmark data + fixture to the patch.
There is only one regression for N==24 in the equality case (this can be seen in the length24_eq test).

https://bugs.llvm.org/show_bug.cgi?id=36421

I've added a test for this one, but note that it still requires -extra-vectorizer-passes because there is not EarlyCSE pass after the function simplification pipeline. I could leave it like this or add one non-optionally, WDYT ?

Benchmark results:

Benchmark fixture:

courbet retitled this revision from [ExpandMemCmp] Improve generated code for simple non-equality compares. to [ExpandMemCmp][MergeICmps] Move passes out of CodeGen into opt pipeline..Apr 11 2019, 8:32 AM
courbet edited the summary of this revision. (Show Details)

There is still some test fixing to do for Power, but before I do that I'd like to get your opinion on the approach, in particular regarding the pass placement (I pretty much placed it randomly here).

It's next to the memcpy optimization pass, so that seems reasonable to me. But I don't claim any expertise on pass management/placement, so let's add more potential reviewers.
Inline summary for those folks: we'd like to move memcmp expansion from codegen to late in the IR pipeline (still under the control of a target hook) because that can unlock follow-on optimizations for CSE and instcombine. Examples:
https://bugs.llvm.org/show_bug.cgi?id=36421
https://bugs.llvm.org/show_bug.cgi?id=34032#c13

The nice thing here is that this discovered many more optimization opportunities (e.g. the length2 tests). I attached some benchmark data + fixture to the patch.
There is only one regression for N==24 in the equality case (this can be seen in the length24_eq test).

I've added a test for this one, but note that it still requires -extra-vectorizer-passes because there is not EarlyCSE pass after the function simplification pipeline. I could leave it like this or add one non-optionally, WDYT ?

I haven't seen any complaints about the cost of EarlyCSE, so my initial guess is just add it within addMemcmpPasses(). If there's a way to predicate running it on successful memcmp expansion, that would be nice.

Finally (and this could be a follow-on change), we should make the corresponding change to the new pass manager too, so we don't lose these optimizations when we flip that setting.

llvm/test/CodeGen/X86/memcmp.ll
1

It's great to see the end-to-end improvements here in the review, but I think we should not include this in the final commit (assuming we proceed).

We should have IR-only regression tests for the passes in question. We could also include 'opt -O2' phase ordering tests within 'test/Transforms/PhaseOrdering/'.

To make sure we have the complete opt+llc wins, we could add some small memcmp/bcmp benchmarks to test-suite. (I've never done that, so we can ask others for advice on how to do that.)

courbet updated this revision to Diff 196257.Apr 23 2019, 8:22 AM
  • Move all tests out of CodeGen.
  • Update PowerPC tests.
  • Add EarlyCSE to addMemcmpPasses().
spatel added a comment.May 6 2019, 6:17 AM

The code changes look like what I expected, but I'd still like to hear from at least 1 other reviewer that we are not violating some high-level principle.

llvm/test/Transforms/ExpandMemCmp/PowerPC/memcmpIR.ll
69–71

Why are we generating bswap for a big-endian target?

courbet updated this revision to Diff 198281.May 6 2019, 8:03 AM
courbet marked an inline comment as done.

Fixing RUN lines for PowerPC opt tests.

courbet added inline comments.May 6 2019, 8:09 AM
llvm/test/Transforms/ExpandMemCmp/PowerPC/memcmpIR.ll
69–71

Thanks for the catch. We're not, but contrary to llc, opt does not seem to get the data layout from the target, so I/m now explicitly specifying the data layout on the RUN line.

The general approach of expanding memcmp in opt makes sense, I think. The placement seems a little on the early side, but that's probably okay given we don't really have interesting optimizations for memcmp calls besides expanding them.

Not sure about the extra EarlyCSE invocation; it's not free.

llvm/test/Other/opt-O2-pipeline.ll
142

How hard would it be to preserve the domtree?

courbet planned changes to this revision.May 6 2019, 11:37 PM
courbet marked an inline comment as done.

Thanks Eli.

llvm/test/Other/opt-O2-pipeline.ll
142

I think it should not be too hard because we merely add blocks in a diamond in the middle of the graph, so the change is quite local. I'll have a look at that.

RKSimon added a subscriber: RKSimon.

@courbet Please can you ensure you have EXPENSIVE_CHECKS enabled in all your builds before going any further - you've been breaking the buildbots for well over a week now and you still haven't fixed the underlying issue.

@courbet Please can you ensure you have EXPENSIVE_CHECKS enabled in all your builds before going any further - you've been breaking the buildbots for well over a week now and you still haven't fixed the underlying issue.

Thanks for the suggestion (I can't believe I lived without this for so long) - I could finally reproduce and I have a fix (D62193).

courbet updated this revision to Diff 201689.May 28 2019, 9:07 AM
courbet marked an inline comment as done.

Make ExpandMemCmp preserve the DomTree.

ychen added a subscriber: ychen.Jun 20 2019, 3:52 PM

Ping. Any other concerns ?

spatel added inline comments.Jun 24 2019, 10:12 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
622–623 ↗(On Diff #201689)

Can we take/use the Function's OptSize attribute as a preliminary step for this patch to reduce the number of diffs?

courbet updated this revision to Diff 206390.Jun 25 2019, 1:46 AM
courbet marked an inline comment as done.

Split off options refactoring to r364281, rebase.

spatel added inline comments.Jun 25 2019, 6:49 AM
llvm/test/Transforms/PhaseOrdering/X86/memcmp.ll
750–755

Why/how are we checking x86 asm in an IR transform test file?

I don't think there's a good way to do end-to-end testing now within the regression test dir.

We would be better off creating real end-to-end (C source --> x86 asm) tests within test-suite? That way, we can be sure that no passes anywhere in the pipeline are interfering with our memcmp patterns.

courbet marked an inline comment as done.Jun 25 2019, 7:13 AM
courbet added inline comments.
llvm/test/Transforms/PhaseOrdering/X86/memcmp.ll
750–755

Right, I think I messed up updating the tests, sorry. The intent was to check for IR here. Will fix.

courbet updated this revision to Diff 206445.Jun 25 2019, 7:46 AM

Remove asm testing.

spatel accepted this revision.Jun 25 2019, 8:22 AM

LGTM - see inline for a few more nits.

I encourage adding small memcmp tests to test-suite as a follow-up, so we know that things won't break going forward.

I haven't seen the DomTreeUpdater API before now, so I'm assuming the tests are verifying that we made the correct updates (and the earlier rL361239 has survived in trunk).

llvm/lib/Transforms/Scalar/ExpandMemCmp.cpp
308–314

Formatting is off here - line that fits 80-col is split, and line that doesn't fit is not split.

435

Formatting - split line.

509

Formatting - split line.

652

Formatting - split line.

llvm/test/Transforms/PhaseOrdering/X86/memcmp.ll
6

Update stale comment:
'This tests interaction between the MergeICmp and ExpandMemCmp IR transform passes.'

This revision is now accepted and ready to land.Jun 25 2019, 8:22 AM
xbolva00 added inline comments.
llvm/lib/Transforms/Scalar/ExpandMemCmp.cpp
252–253

Unused parameter?

courbet updated this revision to Diff 206614.Jun 26 2019, 2:14 AM

Rebase on r364384.

courbet marked 6 inline comments as done.Jun 26 2019, 2:19 AM

I encourage adding small memcmp tests to test-suite as a follow-up, so we know that things won't break going forward.

I'm working on it.

I haven't seen the DomTreeUpdater API before now, so I'm assuming the tests are verifying that we made the correct updates (and the earlier rL361239 has survived in trunk).

That was my first go at it too. Tests do break when I remove the updates, so I'm guessing they do :D

courbet updated this revision to Diff 206629.Jun 26 2019, 4:44 AM

clang-format

For information, I had to roll this back at it breaks sanitizers. Sanitizer passes insert some nobuiltin attributes on all memcmp calls to prevent expansion and still be able to intercept them (maybeMarkSanitizerLibraryCallNoBuiltin), and these passes are added last in opt:

PMBuilder.addExtension(PassManagerBuilder::EP_OptimizerLast, addMemorySanitizerPass);

so they now run after ExpandMemCmp.

The fix would be to move addMemcmpPasses(MPM); after addExtensionsToPM(EP_ScalarOptimizerLate, MPM);. I'll do this after I get the test-suite benchmarks in, so that we can validate the interactions.

test-suite benchmarks: https://reviews.llvm.org/D64082

Full benchmark results for reference (haswell, 10 runs):

The cells highlighted in blue are the statistically significant ones.

@courbet What's happening with this patch?

I rolled back due to some bot breakages that I was not able to see were related or not. I have not come back to it yet, will do.

RKSimon requested changes to this revision.Sep 6 2019, 7:45 AM

reopening until the regressions have been investigated

This revision now requires changes to proceed.Sep 6 2019, 7:45 AM
courbet updated this revision to Diff 219339.Sep 9 2019, 6:12 AM
  • Move mem passes after sanitizer passes.
  • Move codegen test llvm/test/CodeGen/AArch64/bcmp-inline-small.ll to Transforms/ now that AArch64 expands memcmps.
  • Rebase on r371397.

Sorry, this change broke the buildbots: http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/17395, I reverted it (r371507). Please ensure that you run ninja check-all before committing.

We also detected some compile time regressions in the CTMark subset of the test suite, with lencod regressing by around 4-5%. I haven't fully bisected but the commit list was short and this seemed to be the only suspicious change.

nikic added a subscriber: nikic.Oct 23 2019, 12:07 PM

@courbet what's happening with this patch?

courbet abandoned this revision.Nov 22 2019, 2:43 AM

@courbet what's happening with this patch?

I'm going to abandon it for now. This interferes with the sanitizers in hard to fix ways, and I don't have the bandwith to come back to it. sorry.

llvm/test/CodeGen/X86/O3-pipeline.ll