This is an archive of the discontinued LLVM Phabricator instance.

Representing bundle locked groups as fragments in MCAssembler
Needs ReviewPublic

Authored by colinl on May 4 2016, 9:52 AM.

Details

Summary

The motivation for this change is because we're interested in using bundle locked groups for Hexagon though want the option to relax previous instructions to satisfy alignment rather than only emitting nops.

The high level changes are creating two new fragment types, MCBundleLockFragment and MCBundleUnlockFragment and handling all the bundle logic in MCObjectStreamer rather than MCElfStreamer/MCAssembler.

computeBundlePadding was converted to MCBundleLockFragment::getSize which is queried when computing fragment size.

A function MCAssembler::writeNopData was extracted to write nop data according to bundling requirements.

I removed special functionality around relax-all handling. I'm not sure how much assemble-time memory it was saving but it was generating degenerate sequences of nops:

foo:
       0:       55      pushl   %ebp
       1:       66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00    nopw    %cs:(%eax,%eax)
      10:       66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00    nopw    %cs:(%eax,%eax)
      1f:       90      nop
      20:       c7 04 24 01 00 00 00    movl    $1, (%esp)

Memory usage will decrease during non bundle locked assembling due to the fact that the bundle lock members were removed from the MCFragment base class.

Diff Detail

Event Timeline

colinl updated this revision to Diff 56163.May 4 2016, 9:52 AM
colinl retitled this revision from to Representing bundle locked groups as fragments in MCAssembler.
colinl updated this object.
colinl set the repository for this revision to rL LLVM.
colinl added a subscriber: llvm-commits.

I love how much more localized the blundle handling code is with this patch.

I have never used the feature myself, I I don't think I am comfortable being the main reviewer. Eli, what do you think?

phosek edited edge metadata.EditedMay 6 2016, 11:20 AM

I'm happy with the removal of the relax-all related logic. This was a workaround to reduce the memory usage in constrained environments like PNaCl (where we only have 768MB of address space available), but at the cost of producing fairly suboptimal instruction sequences in some cases. We've recently switched from llc to Subzero for PNaCl and so this workaround is no longer needed.

The new implementation looks significantly cleaner, introducing the new fragment type for bundle handling definitely seems like the right way.

eliben edited edge metadata.May 6 2016, 11:40 AM

I love how much more localized the blundle handling code is with this patch.

I have never used the feature myself, I I don't think I am comfortable being the main reviewer. Eli, what do you think?

Since I haven't touched this code base for years and Petr has been maintaining it, I fully defer to his judgement

dschuff edited edge metadata.May 6 2016, 11:56 AM

In general I like this direction too. Here are a few preliminary comments. Wrt mc-relax-all, we haven't completely replaced llc in PNaCl yet; I wouldn't necessarily be against removing mc-relax-all anyway though. We might want to consider how this would impact memory usage, and especially if it's an improvement over the previous "normal" case, then that's a no-brainer.

lib/MC/MCAssembler.cpp
356

I guess this code enforces the requirement that nop data not cross bundle boundaries? Can we add back some of the explanatory comments? (here and elsewhere where it makes sense?)

lib/MC/MCFragment.cpp
191–192

Can this assert just go away now?

test/MC/X86/AlignedBundling/nesting.s
36

Is there some particular reason why the align_to_end has been moved to the first bundle_lock directive here? I think we still have the requirement that a nested directive with the align_to_end property makes the bundle be aligned to end even if the first directive doesn't have it.

colinl updated this revision to Diff 56604.May 9 2016, 11:48 AM
colinl edited edge metadata.
colinl removed rL LLVM as the repository for this revision.

I retained the behavior of oring AlignToEnd across nested bundles and added back in comments explaining how bundling works as well as breaking up the expression in the writeNopData function for better self-documentation.

Let me know if there are parts that were missed.

As far as memory usage testing, does anyone have a test they could run for this or have a recommended way of testing this? I assume this could be tested by assembling some sort of large BC file with and without bundling.

Petr, did you have some way of measuring memory use other than assembling a large module and using top or whatever to look at the RSS?

include/llvm/MC/MCFragment.h
555

Is this illustration still true? Before, fragments with bundle padding had the "empty" padding space (the ###) followed by the content of the fragments. Now IIUC, there's just another fragment which is nothing but padding, correct? And now F->Offset just points to the beginning of the fragment, like any other fragment?

Yes, I wrote a little script which can be used to see the memory usage in relation to the input size. I have uploaded a modified version which should work with the upstream Clang/LLVM to Gist. Running with 1000 functions and 1000 calls can take a long time so I recommend stopping the script after you get enough data to see the trend (the script iteratively subdivides the input space as it runs so over time you'll get more and more precise data). To check the memory consumption in the relax-all case, just add -mc-relax-all to the list of argument on the line 128. I also recommend changing the size of the worker group on line 227 depending on the number of cores you have to avoid hammering your machine.

colinl updated this revision to Diff 58351.May 24 2016, 4:29 PM
colinl set the repository for this revision to rL LLVM.

Running the memory tests pointed out poor usage due to having a bundle lock/unlock fragment for each instruction fragment.

I added a BundleLocked bit to MCFragment to represent a fragment being bundle locket to the next fragment. This means only 1 bit per fragment is needed instead of 2 objects.

I added a vector of bundle alignments to MCSection to represent bundle alignment changes throughout the section. Rather than storing the bundle alignment in each fragment we look up the bounding alignment in this vector.

I have a separate machine that runs the go file and gnuplot and I'll report the results for the redesign.

It looks like in order to get the better optimized final code there is a memory increase of around 12% while assembling. This is attributed to not being able to use the data fragment packing optimization for non-relaxable instructions and instead each instruction is in its own fragment.

What does everyone think about this tradeoff?

For NaCl, the bundle alignment never changes in the section (or ever at all, it only differs between architectures). Does your use case require that ability?

(Also can you please make sure the diff is uploaded with all the context? It makes it much easier to review in phab).

colinl updated this revision to Diff 58831.May 27 2016, 1:43 PM

Added more diff context.

With Hexagon there can be a penalty when jumping to an instruction that crosses an alignment boundary so especially for things like the top of a loop we don't want that instruction to cross that boundary.

I'm hoping to use instruction bundling to do this since it's the semantics we want but it also means we want bundling to be off usually and only on for specific instructions.

So, from the perspective of an assembly file, you would put extra .bundle_align_mode directives to enable and disable bundling during assembly? I guess you would still want to do this for large groups of instructions, else it's no better than just putting .align before each jump target, right?

(oh by the way there should definitely be some tests for that behavior; the existing code was written assuming there could only ever be one .bundle_align_mode in each asm file.)

colinl added a comment.Jun 2 2016, 1:24 PM

I'll add a test for multiple bundle alignment directives.

The semantics are slightly different than a plain alignment. Plain align will always pad to an alignment boundary though bundle locking only does this if the instruction crosses the boundary.

{ sub } ; 4-byte
.align 16
{ add } ; 8-byte
{ sub } ; 4-byte

Would assemble as
0: { sub
4: nop
8: nop
12: nop }
16: { add }
24: { sub }

Whereas bundle locking would do.

{ sub } ; 4-byte
.bundle_align_mode 4
{ add; add } ; 8-byte
.bundle_align_mode 0
{ sub } ; 4-byte

Would assemble as
0: { sub }
4: { add }
12: { sub }

sidneym resigned from this revision.Mar 27 2020, 5:43 PM
bcain added a subscriber: bcain.Mar 30 2020, 6:40 PM