Page MenuHomePhabricator

[Pass Pipeline] Run another round of reassociation after loop pipeline
Needs ReviewPublic

Authored by nemanjai on May 9 2019, 5:38 AM.

Details

Summary

Unrolling can create code that looks a little silly and InstCombine doesn't clean it up. The test case added in this patch ends up with a series of adds in the loop body rather than a shift and add. Reassociation cleans that type of code up, but we don't run it after unrolling.
This patch just adds another round of reassociation after loop unrolling (similarly to what we do with InstCombine).

Performance measurements on PPC show some improvements on a few benchmarks and no noticeable degradations.

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai created this revision.May 9 2019, 5:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2019, 5:38 AM
Whitney added a subscriber: Whitney.May 9 2019, 6:53 AM

The motivation makes sense to me, but someone else should also review this patch in case there's a better way.
We also need to know if there's a compile-time impact. Get data for building test-suite, clang itself, SPEC, or some other benchmarks?

test/Transforms/LoopVectorize/X86/masked_load_store.ll
2–4 ↗(On Diff #198800)

Regardless of anything else, this test file was over-reaching, so I fixed that problem:
rL360340

If you update/rebase, this should not wiggle with this patch now.

test/Transforms/Reassociate/reassociate-after-unroll.ll
1 ↗(On Diff #198800)

This test file belongs in test/Transforms/PhaseOrdering. I prefer to have the baseline test with complete, auto-generated checks (utils/update_test_checks.py) committed as a preliminary step, so we can see the before/after diff in this review.

If you're updating the new pass manager in this patch, this test should have another RUN line to exercise/verify that path.

Running reassociate after unroll probably makes sense. But I'd like to see compile-time numbers.

How carefully have you considered the exact placement? It looks like you're using different placement for the legacy vs. new pass manager. Do we want to run before the late LICM pass?

nemanjai marked 2 inline comments as done.May 9 2019, 12:26 PM

Running reassociate after unroll probably makes sense. But I'd like to see compile-time numbers.

How carefully have you considered the exact placement? It looks like you're using different placement for the legacy vs. new pass manager. Do we want to run before the late LICM pass?

I will collect some compile time numbers with test-suite.
Regarding placement: I didn't really consider it very careful as I don't really know what the tradeoffs would be for various positions. The one thing that seems clear is that it needs to run after unrolling. However, where in the pipeline after unrolling... I am most certainly open to suggestions and can experiment with a few suggested options.

Thanks again for your feedback.

test/Transforms/LoopVectorize/X86/masked_load_store.ll
2–4 ↗(On Diff #198800)

Will do, thank you.

test/Transforms/Reassociate/reassociate-after-unroll.ll
1 ↗(On Diff #198800)

I will move it and add a RUN line for the NPM. Thanks for the suggestions.

nemanjai updated this revision to Diff 199012.May 10 2019, 6:51 AM

Move the newly added test case and update it to only show the different behaviour (after committing it to show the current behaviour in r360426)
Move the additional run of reassociation before the late LICM pass. I assumed that this is a good place for it in the pipeline since LICM might move things out of the loop and potentially take away some opportunities. This is just based on a weak hunch and I am very much open to suggestions for a better place for this in the pipeline.

I have also run CTMark with and without the patch and it shows a minimal increase in compile time. This was run on a quiet PPC (Power9) machine set up for performance measurements with -j1:

Tests: 10
Metric: compile_time

Program                                        results.base results.modified diff
 test-suite :: CTMark/kimwitu++/kc.test         37.41        37.67            0.7%
 test-suite...ark/tramp3d-v4/tramp3d-v4.test    76.74        77.09            0.4%
 test-suite :: CTMark/Bullet/bullet.test        90.14        90.48            0.4%
 test-suite :: CTMark/SPASS/SPASS.test          42.44        42.57            0.3%
 test-suite...:: CTMark/sqlite3/sqlite3.test    45.18        45.30            0.3%
 test-suite...Mark/mafft/pairlocalalign.test    40.60        40.67            0.2%
 test-suite :: CTMark/lencod/lencod.test        63.90        64.01            0.2%
 test-suite...TMark/7zip/7zip-benchmark.test   134.58       134.78            0.2%
 test-suite...-typeset/consumer-typeset.test    35.08        35.05           -0.1%
 test-suite...:: CTMark/ClamAV/clamscan.test    51.71        51.72            0.0%
 Geomean difference                                                           0.3%
       results.base  results.modified       diff
count  10.000000     10.000000         10.000000
mean   61.778200     61.933290         0.002511
std    31.338327     31.402827         0.002218
min    35.080900     35.050700        -0.000861
25%    41.058275     41.143850         0.001559
50%    48.446250     48.509000         0.002115
75%    73.530500     73.816250         0.003576
max    134.579000    134.781900        0.006955

Yeah, I pulled it out. I'm sorry about that. I'm not sure how to make this test case work across all targets. Adding the triple didn't seem to work. I'm looking into how to do this.

Adding the triple didn't seem to work

If you're dealing with certain passes like unrolling, they depend on the target actually being compiled, because we query the target for heuristics. You can write something like "REQUIRES: powerpc-registered-target" if necessary.

nemanjai updated this revision to Diff 199354.May 13 2019, 6:35 PM

Update the new test case. Thanks @efriedma for the tip.

nemanjai updated this revision to Diff 199355.May 13 2019, 6:37 PM

Remove an unrelated change that snuck in by accident.

I have also run CTMark with and without the patch and it shows a minimal increase in compile time. This was run on a quiet PPC (Power9) machine set up for performance measurements with -j1:

Geomean difference                                                           0.3%

It would be interesting to see how that result translates on a more typical x86 build machine. Either way, I suspect we'll get different opinions about whether a 0.3% time increase is minimal and whether that cost is worth paying for the runtime perf gains. This might be a case for differentiating between -O2 and -O3?

It would be interesting to see how that result translates on a more typical x86 build machine. Either way, I suspect we'll get different opinions about whether a 0.3% time increase is minimal and whether that cost is worth paying for the runtime perf gains. This might be a case for differentiating between -O2 and -O3?

I don't really have access to a typical x86 server. I can get the numbers on my laptop but I'm not sure how typical that is. Would that suffice?
Also, I am happy to guard this with an -O3 requirement.

It would be interesting to see how that result translates on a more typical x86 build machine. Either way, I suspect we'll get different opinions about whether a 0.3% time increase is minimal and whether that cost is worth paying for the runtime perf gains. This might be a case for differentiating between -O2 and -O3?

I don't really have access to a typical x86 server. I can get the numbers on my laptop but I'm not sure how typical that is. Would that suffice?

IMO, it's not required for you to gather more data, but some form of x86 is the common case, so that would be a better data point for most people. If we don't get that experiment pre-commit, then I'd expect some x86 bot to flag this change if it's a problem.

Also, I am happy to guard this with an -O3 requirement.

That would remove potential controversy (again, just my opinion) because that's how we limited 'AggressiveInstCombine', but let's see if anyone else (@efriedma @echristo ?) has a different idea.

-O3 makes sense, probably. I mean, reassociate is unlikely to hurt performance, but a second reassociate pass is unlikely to help much unless some pass like unrolling generates new code after the first reassociate.

Can you post the actual performance results? It's hard to judge whether 0.3% cost across the entire compiler is acceptable without knowing the benefits.

fhahn added a subscriber: fhahn.Tue, May 21, 6:06 AM