This is an archive of the discontinued LLVM Phabricator instance.

Put opt-viewer critical items in parallel
ClosedPublic

Authored by bcain on Nov 22 2016, 5:30 AM.

Details

Summary

Put opt-viewer critical items in parallel

Requires features from Python 2.7

Performance
Below are performance results across various configurations. These were taken on an i5-5200U (dual core + HT). They were taken with a small subset of the YAML output of building Python 3.6.0b3 with LTO+PGO. 60 YAML files.

"multiprocessing" is the current submission contents. "baseline" is as of 544f14c6b2a07a94168df31833dba9dc35fd8289 (I think this is aka r287505).

"ImportError" vs "class<...CLoader>" below are just confirming the expected configuration (with/without CLoader).

The below was measured on AMD A8-5500B (4 cores) with 224 input YAML files, showing a ~1.75x speed increase over the baseline with libYAML. I suspect it would scale well on high-end servers.

**************************************** MULTIPROCESSING ****************************************
PyYAML:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
        ImportError: cannot import name CLoader
        Python 2.7.10
489.42user 5.53system 2:38.03elapsed 313%CPU (0avgtext+0avgdata 400308maxresident)k
0inputs+31392outputs (0major+473540minor)pagefaults 0swaps

PyYAML+libYAML:
        <class 'yaml.cyaml.CLoader'>
        Python 2.7.10
78.69user 5.45system 0:32.63elapsed 257%CPU (0avgtext+0avgdata 398560maxresident)k
0inputs+31392outputs (0major+542022minor)pagefaults 0swaps

PyPy/PyYAML:
        Traceback (most recent call last):
          File "<builtin>/app_main.py", line 75, in run_toplevel
          File "<builtin>/app_main.py", line 601, in run_it
          File "<string>", line 1, in <module>
        ImportError: cannot import name 'CLoader'
        Python 2.7.9 (2.6.0+dfsg-3, Jul 04 2015, 05:43:17)
        [PyPy 2.6.0 with GCC 4.9.3]
154.27user 8.12system 0:53.83elapsed 301%CPU (0avgtext+0avgdata 627960maxresident)k
808inputs+30376outputs (0major+727994minor)pagefaults 0swaps
**************************************** BASELINE        ****************************************
PyYAML:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
        ImportError: cannot import name CLoader
        Python 2.7.10
        358.08user 4.05system 6:08.37elapsed 98%CPU (0avgtext+0avgdata 315004maxresident)k
0inputs+31392outputs (0major+85252minor)pagefaults 0swaps

PyYAML+libYAML:
        <class 'yaml.cyaml.CLoader'>
        Python 2.7.10
50.32user 3.30system 0:56.59elapsed 94%CPU (0avgtext+0avgdata 307296maxresident)k
0inputs+31392outputs (0major+79335minor)pagefaults 0swaps

PyPy/PyYAML:
        Traceback (most recent call last):
          File "<builtin>/app_main.py", line 75, in run_toplevel
          File "<builtin>/app_main.py", line 601, in run_it
          File "<string>", line 1, in <module>
        ImportError: cannot import name 'CLoader'
        Python 2.7.9 (2.6.0+dfsg-3, Jul 04 2015, 05:43:17)
        [PyPy 2.6.0 with GCC 4.9.3]
72.94user 5.18system 1:23.41elapsed 93%CPU (0avgtext+0avgdata 455312maxresident)k
0inputs+30392outputs (0major+110280minor)pagefaults 0swaps

Diff Detail

Repository
rL LLVM

Event Timeline

bcain updated this revision to Diff 78858.Nov 22 2016, 5:30 AM
bcain retitled this revision from to Put opt-viewer critical items in parallel.
bcain updated this object.
bcain added reviewers: anemet, fhahn.
bcain set the repository for this revision to rL LLVM.
bcain added a subscriber: llvm-commits.
fhahn edited edge metadata.Nov 22 2016, 5:38 AM

Is baseline with PyYAML+libYAML just slightly slower than multiprocessing with PyYAML+libYAML (0:12.57elapsed vs 0:09.66elapsed) or am I missing something?

bcain updated this revision to Diff 78860.Nov 22 2016, 5:49 AM
bcain edited edge metadata.

Re-generated the diff with maximum context.

bcain added a comment.Nov 22 2016, 5:52 AM

Is baseline with PyYAML+libYAML just slightly slower than multiprocessing with PyYAML+libYAML (0:12.57elapsed vs 0:09.66elapsed) or am I missing something?

Sorry, it was a bad case to present. Only 60 input files and minimal parallelism (only 4 tasks spawned). It will scale significantly better on many-core servers + hundreds of input files.

fhahn added a comment.Nov 22 2016, 6:00 AM

I see. I think it would be good to present a case where multiprocessing provides a good speedup compared to the baseline PyYAML+libYAML, which alone seems to yield a 10x speedup vs without PyYAML+libYAML .

bcain updated this object.Nov 23 2016, 5:41 AM

I see. I think it would be good to present a case where multiprocessing provides a good speedup compared to the baseline PyYAML+libYAML, which alone seems to yield a 10x speedup vs without PyYAML+libYAML .

Updated the summary to show the unabridged YAML from Python3.6.0b3 on a 4-core desktop. It's 1.75x over the existing PyYAML+libYAML.

fhahn added a comment.Nov 23 2016, 4:00 PM

Here are some numbers I gathered for analyzing 250 opt.yaml files generated by compiling clang/llvm with the following options: "-O2 -g -fsave-optimization-record -mllvm -pass-remarks -mllvm -pass-remarks-missed". The size of the .opt.yaml files varies between a few kB and a few MB.

I used Python 2.7.6 on Linux with PyYAML and libyaml on an Intel Xeon with a large number of cores (time used to measure the runtime):

baseline:
./opt-viewer-base.py cat yamls_250.txt pwd`/out-baseline 257.55s user 86.05s system 91% cpu 6:15.89 total`

parallel with 10 processes
./opt-viewer.py cat yamls_250.txt pwd/out-parallel-10 607.80s user 268.75s system 160% cpu 9:04.52 total`

parallel with 4 processes
./opt-viewer.py cat yamls_250.txt pwd`/out-parallel-4 524.31s user 196.45s system 163% cpu 7:20.54 total`

I'm not sure if I did something wrong. How big were the opt.yaml files you used?

bcain added a comment.Nov 23 2016, 5:07 PM

Here are some numbers I gathered for analyzing 250 opt.yaml files generated by compiling clang/llvm with the following options: "-O2 -g -fsave-optimization-record -mllvm -pass-remarks -mllvm -pass-remarks-missed". The size of the .opt.yaml files varies between a few kB and a few MB.

I used Python 2.7.6 on Linux with PyYAML and libyaml on an Intel Xeon with a large number of cores (time used to measure the runtime):

Hmm, that's kinda surprising. Maybe the overhead of the Lock() is bigger when there's more competition. Could you artificially limit the parallelism (min(cpu_count(), 4) or using cgroups/taskset)? Or comment out the Lock?

The servers I have access to will be mildly more difficult to run libYAML on but I will give it a go.

I used the Python3.6.0b3 source and only "-fsave-optimization-record". 224 files, ranging from ~10k up to 3.4M.

anemet edited edge metadata.Nov 28 2016, 9:20 AM

Here are some numbers I gathered for analyzing 250 opt.yaml files generated by compiling clang/llvm with the following options: "-O2 -g -fsave-optimization-record -mllvm -pass-remarks -mllvm -pass-remarks-missed". The size of the .opt.yaml files varies between a few kB and a few MB.

There is no need to pass -pass-remarks*, -fsave-optimization-record gathers all remarks.

bcain added a comment.Nov 29 2016, 6:05 PM

Here are some numbers I gathered for analyzing 250 opt.yaml files generated by compiling clang/llvm with the following options: "-O2 -g -fsave-optimization-record -mllvm -pass-remarks -mllvm -pass-remarks-missed". The size of the .opt.yaml files varies between a few kB and a few MB.

...

I'm not sure if I did something wrong. How big were the opt.yaml files you used?

I wasn't able to reproduce these results on an i7 4x2 machine either. I switched from CPython to Apache Thrift. It's a C++ project so it will leverage c++filt more. I varied the number of processes and the time either improved or was neutral as I added processes. I tried with and without -source-dir because I noticed your experiments omitted it. Nothing that I measured was ever slower than the baseline.

Florian, can you upload your specific subset of the LLVM .yaml files somewhere? That way I can try to isolate whether the problem is only visible in that test case or something else. If it's convenient, can you do an experiment where the yaml files are in a tmpfs partition like /dev/shm and see if this patch is still slower than the baseline?

Hi Brian,

I will also do some experiments with this patch but here are my comments on the patch itself. In general feel free to peel off your formatting changes and just commit them right away.

Adam

utils/opt-viewer/opt-viewer.py
60–61 ↗(On Diff #78860)

Please commit formatting changes separately.

68–69 ↗(On Diff #78860)

This too, please commit separately.

134–143 ↗(On Diff #78860)

Looks like another formatting change.

203 ↗(On Diff #78860)

I am still not sure that this is the right thing to do but either way it should go into a separate review.

221–222 ↗(On Diff #78860)

Formatting-only change.

346 ↗(On Diff #78860)

merge_file_remark_dicts

351 ↗(On Diff #78860)

k -> file
k_ -> line
v_ -> remarks

378–396 ↗(On Diff #78860)

Since you're passing all_remarks, it's confusing that you're not passing file_remarks.

bcain added a comment.Nov 30 2016, 3:55 PM

I'll omit the formatting changes from autopep8, the encoding change and fix the exception handling bug in _get_rermarks() and the render_file() file_remarks/all_remarks bug.

utils/opt-viewer/opt-viewer.py
337 ↗(On Diff #78860)

This is not how this failure should get reported.

378–396 ↗(On Diff #78860)

Yes, this looks to me like a bug.

anemet added inline comments.Dec 1 2016, 5:27 PM
utils/opt-viewer/opt-viewer.py
203 ↗(On Diff #78860)

Actually, I think you're right about the encoding. Feel free to commit this part. Thanks!

I see consistent speed-ups but the resulting HTML directories don't match the original. It *seems* there are some extra remarks so I am assuming the uniquing does not work? Can you please look into it? After that I think this is ready to go in.

Also one more high-level comment. We should probably expose the level of parallelism through a '-j' option defaulting to cpu_count().

bcain updated this revision to Diff 80374.Dec 5 2016, 7:06 PM
bcain edited edge metadata.

Removed formatting and other unrelated changes.

Added "--jobs".

Fixed bug in passing file_remarks/all_remarks

bcain marked 6 inline comments as done.Dec 5 2016, 8:16 PM

Response: most changes made as requested.

utils/opt-viewer/opt-viewer.py
346 ↗(On Diff #78860)

I left this as-is. It's designed to be an abstract merge among an iterable of dictionaries.

[ {'a': [3], }, {'a': [4], }, {'b': [6] }] -> {'a': [3,4,], 'b': [6]}

If you feel it would be better with the specific concepts we're leveraging here, I'll change it.

anemet added inline comments.Dec 5 2016, 8:24 PM
utils/opt-viewer/opt-viewer.py
346 ↗(On Diff #78860)

Sounds good, just add a comment at the call then that this is merging the list of remarks at each line of each file.

anemet added a comment.Dec 7 2016, 9:18 AM

Brian, just to make sure we're not deadlocked here waiting for each other. I am still looking for an answer to this question:

I see consistent speed-ups but the resulting HTML directories don't match the original. It *seems* there are some extra remarks so I am assuming the uniquing does not work? Can you please look into it? After that I think this is ready to go in.

bcain added a comment.Dec 7 2016, 6:20 PM

Brian, just to make sure we're not deadlocked here waiting for each other. I am still looking for an answer to this question:

I see consistent speed-ups but the resulting HTML directories don't match the original. It *seems* there are some extra remarks so I am assuming the uniquing does not work? Can you please look into it? After that I think this is ready to go in.

I'd hoped that might go away when I fixed the args getting passed to generate_report().

I'll run some tests to see if I can reproduce the problem.

bcain updated this revision to Diff 80710.Dec 7 2016, 6:55 PM
bcain marked 7 inline comments as done.

Updated comments for merge_dicts().

Adam, can you provide a test case that illustrates the problem? I didn't see it with the last two revisions.

anemet added a comment.Dec 8 2016, 2:50 PM

Adam, can you provide a test case that illustrates the problem? I didn't see it with the last two revisions.

I just sent you an email with the details.

Thanks,
Adam

bcain added a comment.Jan 19 2017, 7:05 AM

Adam, can you provide a test case that illustrates the problem? I didn't see it with the last two revisions.

I just sent you an email with the details.

The differences are limited to the index.html. Most of the differences are inconsequential. I'm trying to explain them one by one and a few more unexplained ones remain, so stay tuned.

The inconsequential ones are due to ordering differences in the index.html between entries with identical hotness. I can make those differences go away if I change both the baseline and the patch to have this line:

sorted_remarks = sorted(all_remarks.itervalues(), key=lambda r: (r.Hotness, r.__dict__), reverse=True)

... instead of sorting on Hotness alone, this tuple also considers all the other fields of the remark.

I will include this change with my next update. It would stabilize comparisons going forward but it won't make the output comparison from this patch match the baseline.

Thanks for the update, Brian:

I just sent you an email with the details.

The differences are limited to the index.html. Most of the differences are inconsequential. I'm trying to explain them one by one and a few more unexplained ones remain, so stay tuned.

There were differences in the source views as well. I just investigated and looks like the remark uniquing is working differently between the serial and parallel -- neither of them being correct.

When we're uniquing the remarks using all_remarks, key does not include the containing function, so if there is no hotness difference between the different functions (inlining contexts) they will get uniqued.

On the other hand, the parallel version only does uniquing within one thread. We do merge all_remarks from each thread but file_remarks at that point could contain duplicates. We should probably unique in merge_dicts?

If you want to see an example of this, diff _Users_adam_proj_org_llvm_build-rel_bin_.._include_c++_v1___functional_base.html between html-orig and html-par from the tarball I sent you. On the very first line we annotate (line 63), there are licm remarks with different inlining contexts on the parallel version but not on the serial one.

bcain added a comment.Jan 23 2017, 7:41 AM

Thanks for the update, Brian:

I just sent you an email with the details.

The differences are limited to the index.html. Most of the differences are inconsequential. I'm trying to explain them one by one and a few more unexplained ones remain, so stay tuned.

There were differences in the source views as well. I just investigated and looks like the remark uniquing is working differently between the serial and parallel -- neither of them being correct.

Oh, yes. I suppose I won't be able to produce most of the source views correctly for this example unless I have all of the source files (and have them installed at the same paths as the producer).

When we're uniquing the remarks using all_remarks, key does not include the containing function, so if there is no hotness difference between the different functions (inlining contexts) they will get uniqued.

On the other hand, the parallel version only does uniquing within one thread. We do merge all_remarks from each thread but file_remarks at that point could contain duplicates. We should probably unique in merge_dicts?

file_remarks is indexed on file-line. The difference that I can see between serial and parallel implementation is not the presence/absence of duplicates but the order in which they're append()ed is not the same. But that shouldn't matter if file_remarks are only referenced while generating the source file renders, right?

If the .key property of Remark would be better off by having a self.Function element, should I just add it?

Also: note that this parallel implementation spawns processes, not threads. This means that globally visible changes like e.g. class-members have to be manually delivered back to the parent. This is why the Remark.max_hotness is not updated in get_remarks(). Changing this implementation to be multithreaded is simple and only requires slightly different imports.

If you want to see an example of this, diff _Users_adam_proj_org_llvm_build-rel_bin_.._include_c++_v1___functional_base.html between html-orig and html-par from the tarball I sent you. On the very first line we annotate (line 63), there are licm remarks with different inlining contexts on the parallel version but not on the serial one.

I'll see if I can make a minimal reproducer, sorting through all of these differences is difficult to reason about IMO.

HI Brian!

If the .key property of Remark would be better off by having a self.Function element, should I just add it?

I think we should just go with your patch as is and incrementally improve the situation from there. As I said the original version is also incorrect so we're technically not regressing anything here.

I have been holding off commits to avoid causing merge conflicts for you but the queue is getting pretty long; I'd like to get this in and go from there.

Also: note that this parallel implementation spawns processes, not threads. This means that globally visible changes like e.g. class-members have to be manually delivered back to the parent. This is why the Remark.max_hotness is not updated in get_remarks(). Changing this implementation to be multithreaded is simple and only requires slightly different imports.

If you want to see an example of this, diff _Users_adam_proj_org_llvm_build-rel_bin_.._include_c++_v1___functional_base.html between html-orig and html-par from the tarball I sent you. On the very first line we annotate (line 63), there are licm remarks with different inlining contexts on the parallel version but not on the serial one.

I'll see if I can make a minimal reproducer, sorting through all of these differences is difficult to reason about IMO.

I wouldn't worry about this. Why don't you just post the latest version of the patch and then we improve/fix thing further in tree.

Thanks again for your work.

Adam

HI Brian!

If the .key property of Remark would be better off by having a self.Function element, should I just add it?

I think we should just go with your patch as is and incrementally improve the situation from there. As I said the original version is also incorrect so we're technically not regressing anything here.

Ok, thanks. Can you commit this patch under review, then? I don't believe that I have commit privileges.

-Brian

Ok, thanks. Can you commit this patch under review, then? I don't believe that I have commit privileges.

Sure!

anemet accepted this revision.Jan 24 2017, 11:41 AM
This revision is now accepted and ready to land.Jan 24 2017, 11:41 AM
This revision was automatically updated to reflect the committed changes.

Brian,

I've found a few more things that needed fixing. Please see commits r293262-r293266. Let me know if you have comments.

Also my speed up for the files under lib/Transform/Scalars is from 9m42 to 4m17.

Thanks again,
Adam