This opens up numerous possibilities for optimizations.
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 31455 Build 31454: arc lint + arc unit
Event Timeline
I agree that we shouldn't have to do anything too fancy here. Ideally, DAGCombiner would take care of this generally, so we're not optimizing this 1 special-case pattern that includes memcmp and ignoring the general case.
Did you investigate what that would take? We get this in instcombine, so we do have some set of peepholes to potentially copy.
Alternatively, what do you think about making ExpandMemCmp a late IR optimization pass like the vectorizer passes?
When I was looking at memcmp optimizations initially, it was clear that in some larger examples we would benefit from running instcombine/CSE after the expansion. By moving this pass up, we wouldn't have to worry about adding optimizations (like this patch) to the backend because it will all be done for us in existing IR optimization passes.
Alternatively, what do you think about making ExpandMemCmp a late IR optimization pass like the vectorizer passes?
The would be ideal, but unfortunately ExpandMemCmp requires access to the TargetLowering (for TargetLowering::MaxLoadsPerMemcmp which is consistent with what happens for memcpy and memset) . On the other hand, we already have some MemCmpExpansionOptions in TargetTransformInfo, which also knows about memcpy (e.g. getMemcpyLoopLoweringType(), getMemcpyLoopResidualLoweringType and getMemcpyCost), so we could move it here and move ExpandMemCmp to a late IR opt pass. I'll create a patch with this approach.
For reference (and some potential unit and end-to-end test ideas to show wins):
https://bugs.llvm.org/show_bug.cgi?id=36421
https://bugs.llvm.org/show_bug.cgi?id=34032#c13
There is still some test fixing to do for Power, but before I do that I'd like to get your opinion on the approach, in particular regarding the pass placement (I pretty much placed it randomly here).
The nice thing here is that this discovered many more optimization opportunities (e.g. the length2 tests). I attached some benchmark data + fixture to the patch.
There is only one regression for N==24 in the equality case (this can be seen in the length24_eq test).
I've added a test for this one, but note that it still requires -extra-vectorizer-passes because there is not EarlyCSE pass after the function simplification pipeline. I could leave it like this or add one non-optionally, WDYT ?
It's next to the memcpy optimization pass, so that seems reasonable to me. But I don't claim any expertise on pass management/placement, so let's add more potential reviewers.
Inline summary for those folks: we'd like to move memcmp expansion from codegen to late in the IR pipeline (still under the control of a target hook) because that can unlock follow-on optimizations for CSE and instcombine. Examples:
https://bugs.llvm.org/show_bug.cgi?id=36421
https://bugs.llvm.org/show_bug.cgi?id=34032#c13
The nice thing here is that this discovered many more optimization opportunities (e.g. the length2 tests). I attached some benchmark data + fixture to the patch.
There is only one regression for N==24 in the equality case (this can be seen in the length24_eq test).I've added a test for this one, but note that it still requires -extra-vectorizer-passes because there is not EarlyCSE pass after the function simplification pipeline. I could leave it like this or add one non-optionally, WDYT ?
I haven't seen any complaints about the cost of EarlyCSE, so my initial guess is just add it within addMemcmpPasses(). If there's a way to predicate running it on successful memcmp expansion, that would be nice.
Finally (and this could be a follow-on change), we should make the corresponding change to the new pass manager too, so we don't lose these optimizations when we flip that setting.
llvm/test/CodeGen/X86/memcmp.ll | ||
---|---|---|
0 | It's great to see the end-to-end improvements here in the review, but I think we should not include this in the final commit (assuming we proceed). We should have IR-only regression tests for the passes in question. We could also include 'opt -O2' phase ordering tests within 'test/Transforms/PhaseOrdering/'. To make sure we have the complete opt+llc wins, we could add some small memcmp/bcmp benchmarks to test-suite. (I've never done that, so we can ask others for advice on how to do that.) |
- Move all tests out of CodeGen.
- Update PowerPC tests.
- Add EarlyCSE to addMemcmpPasses().
The code changes look like what I expected, but I'd still like to hear from at least 1 other reviewer that we are not violating some high-level principle.
llvm/test/Transforms/ExpandMemCmp/PowerPC/memcmpIR.ll | ||
---|---|---|
69–71 | Why are we generating bswap for a big-endian target? |
llvm/test/Transforms/ExpandMemCmp/PowerPC/memcmpIR.ll | ||
---|---|---|
69–71 | Thanks for the catch. We're not, but contrary to llc, opt does not seem to get the data layout from the target, so I/m now explicitly specifying the data layout on the RUN line. |
The general approach of expanding memcmp in opt makes sense, I think. The placement seems a little on the early side, but that's probably okay given we don't really have interesting optimizations for memcmp calls besides expanding them.
Not sure about the extra EarlyCSE invocation; it's not free.
llvm/test/Other/opt-O2-pipeline.ll | ||
---|---|---|
141 | How hard would it be to preserve the domtree? |
Thanks Eli.
llvm/test/Other/opt-O2-pipeline.ll | ||
---|---|---|
141 | I think it should not be too hard because we merely add blocks in a diamond in the middle of the graph, so the change is quite local. I'll have a look at that. |
@courbet Please can you ensure you have EXPENSIVE_CHECKS enabled in all your builds before going any further - you've been breaking the buildbots for well over a week now and you still haven't fixed the underlying issue.
Thanks for the suggestion (I can't believe I lived without this for so long) - I could finally reproduce and I have a fix (D62193).
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
622–624 | Can we take/use the Function's OptSize attribute as a preliminary step for this patch to reduce the number of diffs? |
llvm/test/Transforms/PhaseOrdering/X86/memcmp.ll | ||
---|---|---|
750–755 | Why/how are we checking x86 asm in an IR transform test file? I don't think there's a good way to do end-to-end testing now within the regression test dir. We would be better off creating real end-to-end (C source --> x86 asm) tests within test-suite? That way, we can be sure that no passes anywhere in the pipeline are interfering with our memcmp patterns. |
llvm/test/Transforms/PhaseOrdering/X86/memcmp.ll | ||
---|---|---|
750–755 | Right, I think I messed up updating the tests, sorry. The intent was to check for IR here. Will fix. |
LGTM - see inline for a few more nits.
I encourage adding small memcmp tests to test-suite as a follow-up, so we know that things won't break going forward.
I haven't seen the DomTreeUpdater API before now, so I'm assuming the tests are verifying that we made the correct updates (and the earlier rL361239 has survived in trunk).
llvm/lib/Transforms/Scalar/ExpandMemCmp.cpp | ||
---|---|---|
299–302 | Formatting is off here - line that fits 80-col is split, and line that doesn't fit is not split. | |
419 | Formatting - split line. | |
490 | Formatting - split line. | |
628 | Formatting - split line. | |
llvm/test/Transforms/PhaseOrdering/X86/memcmp.ll | ||
6 | Update stale comment: |
llvm/lib/Transforms/Scalar/ExpandMemCmp.cpp | ||
---|---|---|
245 | Unused parameter? |
I encourage adding small memcmp tests to test-suite as a follow-up, so we know that things won't break going forward.
I'm working on it.
I haven't seen the DomTreeUpdater API before now, so I'm assuming the tests are verifying that we made the correct updates (and the earlier rL361239 has survived in trunk).
That was my first go at it too. Tests do break when I remove the updates, so I'm guessing they do :D
For information, I had to roll this back at it breaks sanitizers. Sanitizer passes insert some nobuiltin attributes on all memcmp calls to prevent expansion and still be able to intercept them (maybeMarkSanitizerLibraryCallNoBuiltin), and these passes are added last in opt:
PMBuilder.addExtension(PassManagerBuilder::EP_OptimizerLast, addMemorySanitizerPass);
so they now run after ExpandMemCmp.
The fix would be to move addMemcmpPasses(MPM); after addExtensionsToPM(EP_ScalarOptimizerLate, MPM);. I'll do this after I get the test-suite benchmarks in, so that we can validate the interactions.
test-suite benchmarks: https://reviews.llvm.org/D64082
Full benchmark results for reference (haswell, 10 runs):
The cells highlighted in blue are the statistically significant ones.
I rolled back due to some bot breakages that I was not able to see were related or not. I have not come back to it yet, will do.
- Move mem passes after sanitizer passes.
- Move codegen test llvm/test/CodeGen/AArch64/bcmp-inline-small.ll to Transforms/ now that AArch64 expands memcmps.
- Rebase on r371397.
Sorry, this change broke the buildbots: http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/17395, I reverted it (r371507). Please ensure that you run ninja check-all before committing.
We also detected some compile time regressions in the CTMark subset of the test suite, with lencod regressing by around 4-5%. I haven't fully bisected but the commit list was short and this seemed to be the only suspicious change.
I'm going to abandon it for now. This interferes with the sanitizers in hard to fix ways, and I don't have the bandwith to come back to it. sorry.
Can we take/use the Function's OptSize attribute as a preliminary step for this patch to reduce the number of diffs?