This is an archive of the discontinued LLVM Phabricator instance.

[X86][MC] Support enhanced relaxation for branch align
ClosedPublic

Authored by skan on Mar 17 2020, 7:54 AM.

Details

Summary

Since D75300 has been landed, I want to support enhanced relaxation when we need to align branches and allow prefix padding. "Enhanced Relaxtion" means we allow an instruction that could not be traditionally relaxed to be emitted into RelaxableFragment so that we increase its length by adding prefixes for optimization.

The motivation is straightforward, RelaxFragment is mostly for relative jumps and we can not increase the length of jumps when we need to align them, so if we need to achieve D75300's purpose (reducing the bytes of nops) when need to align jumps, we have to make more instructions "relaxable".

Diff Detail

Event Timeline

skan created this revision.Mar 17 2020, 7:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2020, 7:54 AM
skan updated this revision to Diff 251545.Mar 19 2020, 8:20 PM

Split part of code to D76475, the new added test can pass after D76475 is landed

skan updated this revision to Diff 251548.Mar 19 2020, 8:32 PM

[NFC] align the text in the test file

LuoYuanke added inline comments.Mar 19 2020, 9:56 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
165

Why not just name it as "allowRelaxing" ? The relax information is eventually be indicated in instruction's .td file, right?

skan updated this revision to Diff 251553.Mar 19 2020, 10:12 PM

Fix the warning reported by clang-format

skan marked an inline comment as done.Mar 19 2020, 10:19 PM
skan added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
165

See the description in the summary. Traditional relaxation means the operand's size of instruction may be change (e.g, JCC1->JCC4), but if we allow enhanced relaxtion, we can change the instruction's size by adding prefixes on it. They are different, and the enhanced relaxtion has nothing to do with td file

reames requested changes to this revision.EditedMar 24 2020, 4:10 PM

I don't think this is a good idea. Without evidence to the contrary, I would assume that this suffers from exactly the same memory overhead problems as the original rejected patch. The basic challenge is that a MCInst is quite large (~144 bytes if my quick mental math is right). And the containing RelaxableInst is a bit larger still. The storage for a small encoded instruction in a DataFragment without fixups is ~2-6 bytes. That's a huge difference.

An alternate approach is to explore allowing prefixes to be spliced into DataFragments. This would require keeping something analogous to a Fixup with the offset, prefix, and maximum bytes to insert.

Another idea would be a new fragment subclass for instructions which are paddable, but not relaxable. Or possibly a compressed representation for RelaxableFragment (i.e. most of the space/generality appears to be overkill.)

This revision now requires changes to proceed.Mar 24 2020, 4:10 PM
skan added a comment.Mar 27 2020, 2:22 AM

I don't think this is a good idea. Without evidence to the contrary, I would assume that this suffers from exactly the same memory overhead problems as the original rejected patch. The basic challenge is that a MCInst is quite large (~144 bytes if my quick mental math is right). And the containing RelaxableInst is a bit larger still. The storage for a small encoded instruction in a DataFragment without fixups is ~2-6 bytes. That's a huge difference.

I did some measurement for the memory usage, and will paste the data next week.

An alternate approach is to explore allowing prefixes to be spliced into DataFragments. This would require keeping something analogous to a Fixup with the offset, prefix, and maximum bytes to insert.

I don't want that DataFragment can change its size, it's very strange. And if we allow prefixes to be spliced into DataFragments, we have to add a vector member to DataFragment to record the position where each encoded instruction starts. It would increase the size of DataFragment even if we do not need prefix padding.

Another idea would be a new fragment subclass for instructions which are paddable, but not relaxable. Or possibly a compressed representation for RelaxableFragment (i.e. most of the space/generality appears to be overkill.)

Unless we don't need the MCInst member, we almost can not have a smaller fragment than RelaxableFragment to do the padding.
In addition, introducing a new kind of fragment would increase code complexity in MCAssembler.cpp , X86AsmBackend::padInstruction a lot.

skan added a comment.Apr 4 2020, 6:37 AM

Applied D76286,D76475 on commit 7b808b105f6aedc4066502b68b71cf205bafa582

I collected the memory usage for the prefix padding and NOP padding when building SPEC

With LTO, the prefix padding solution consumes 17% more memory than NOP padding in geomean. The highest outlier is 43%. Without LTO, the prefix padding solution consumes 5% more memory than NOP padding in geomean. The highest outlier is 28%.

Options for NOP padding:
-x86-align-branch-boundary=32 -x86-align-branch=fused+jcc+jmp+indirect+call+ret -x86-pad-max-prefix-size=0

Options for prefix padding:
-x86-align-branch-boundary=32 -x86-align-branch=fused+jcc+jmp+indirect+call+ret -x86-pad-max-prefix-size=5

skan edited the summary of this revision. (Show Details)Apr 4 2020, 6:46 AM
skan added a comment.Apr 6 2020, 10:08 PM

Applied D76286,D76475 on commit 7b808b105f6aedc4066502b68b71cf205bafa582

I collected the memory usage for the prefix padding and NOP padding when building SPEC

With LTO, the prefix padding solution consumes 17% more memory than NOP padding in geomean. The highest outlier is 43%. Without LTO, the prefix padding solution consumes 5% more memory than NOP padding in geomean. The highest outlier is 28%.

Options for NOP padding:
-x86-align-branch-boundary=32 -x86-align-branch=fused+jcc+jmp+indirect+call+ret -x86-pad-max-prefix-size=0

Options for prefix padding:
-x86-align-branch-boundary=32 -x86-align-branch=fused+jcc+jmp+indirect+call+ret -x86-pad-max-prefix-size=5

Is the extra memory overhead acceptable to reviewers ? It occurs only when we need to align branches by prefix padding, and provide the user more choices for padding branches.

skan edited the summary of this revision. (Show Details)Apr 7 2020, 6:53 PM

Ping

reames accepted this revision.Apr 7 2020, 8:59 PM

LGTM w/one required change before submit. This desperately needs test cases demonstrating the padding of instructions which aren't relaxable. Please add at least a couple. I'm fine adding some additional ones in a separate commit, but landing this without any would be inappropriate.

On the memory usage, I think the demonstrated results are not great - 45% increase worse case in LTO on *spec* is actually pretty bad -, but it's a reasonable starting point to optimize from. The non-LTO maximum looks a lot more reasonable, and I suspect this is good enough for many real users. We can continue iterating on the representation to reduce memory usage in tree.

This revision is now accepted and ready to land.Apr 7 2020, 8:59 PM
skan updated this revision to Diff 255917.Apr 8 2020, 12:56 AM

Address review comments(demonstrate the instruction being padded may be unrelaxable)

skan added a comment.Apr 8 2020, 1:20 AM

Remote builds failed due to the bug of pre-merge check. It tried to apply a landed patch. What's the right way to work around it? Remove some parent revision or ignore the fail?

skan updated this revision to Diff 255927.Apr 8 2020, 2:07 AM

Nothing changes, just try to trigger a new remote build

Harbormaster completed remote builds in B52309: Diff 255917.
This revision was automatically updated to reflect the committed changes.