This is an archive of the discontinued LLVM Phabricator instance.

[X86] Relax existing instructions to reduce the number of nops needed for alignment purposes
ClosedPublic

Authored by reames on Feb 26 2020, 11:21 AM.

Details

Summary

If we have an explicit align directive, we currently default to emitting nops to fill the space. As discussed in the context of the prefix padding work for branch alignment (D72225), we're allowed to play other tricks such as extending the size of previous instructions instead.

This patch will convert near jumps to far jumps if doing so decreases the number of bytes of nops needed for a following align. It does so as a post-pass after relaxation is complete. It intentionally works without moving any labels or doing anything which might require another round of relaxation.

The point of this patch is mainly to mock out the approach. The optimization implemented is real, and possibly useful, but the main point is to demonstrate an approach for implementing such "pad previous instruction" approaches. The key notion in this patch is to treat padding previous instructions as an optional optimization, not as a core part of relaxation. The benefit to this is that we avoid the potential concern about increasing the distance between two labels and thus causing further potentially non-local code grown due to relaxation. The downside is that we may miss some opportunities to avoid nops.

For the moment, this patch only implements a small set of existing relaxations.. Assuming the approach is satisfactory, I plan to extend this to a broader set of instructions where there are obvious "relaxations" which are roughly performance equivalent.

Note that this patch *doesn't* change which instructions are relaxable. We may wish to explore that separately to increase optimization opportunity, but I figured that deserved it's own separate discussion.

There are possible downsides to this optimization (and all "pad previous instruction" variants). The major two are potentially increasing instruction fetch and perturbing uop caching. (i.e. the usual alignment risks) Specifically:

  • If we pad an instruction such that it crosses a fetch window (16 bytes on modern X86), we may cause the decoder to have to trigger a fetch it wouldn't have otherwise. This can effect both decode speed, and icache pressure.
  • Intel's uop caching have particular restrictions on instruction combinations which can fit in a particular way. By moving around instructions, we can both cause misses an change misses into hits. Many of the most painful cases are around branch density, so I don't expect this to be too bad on the whole.

On the whole, I expect to see small swings (i.e. the typical alignment change problem), but nothing major or systematic in either direction.

Diff Detail

Event Timeline

reames created this revision.Feb 26 2020, 11:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 26 2020, 11:21 AM
reames updated this revision to Diff 246799.Feb 26 2020, 11:29 AM

Remove some stray debugging code, and tweak style a bit as a result of no-longer needed variables.

reames updated this revision to Diff 246806.Feb 26 2020, 11:53 AM

Fix a minor bug in new test - intel vs att syntax bites again.

reames edited the summary of this revision. (Show Details)Feb 26 2020, 2:34 PM
reames updated this revision to Diff 246848.Feb 26 2020, 3:30 PM
reames edited the summary of this revision. (Show Details)

Add support for boundary align

At first, I thought we'd be able to handle other directives as well (such as .org), but a closer read indicates I had misread the semantics of those.

skan added a subscriber: skan.Feb 26 2020, 6:17 PM
skan added inline comments.Feb 26 2020, 7:31 PM
llvm/include/llvm/MC/MCFragment.h
278 ↗(On Diff #246848)

For data

279 ↗(On Diff #246848)

"Value is repeated ValueSize times" , I am afraid this statement is not true. ValueSize is the size of the integer (in bytes) of Value, and the Value is repeated FragmentSize / AF.getValueSize() times.

reames marked an inline comment as done.Feb 27 2020, 9:44 AM
reames added inline comments.
llvm/include/llvm/MC/MCFragment.h
279 ↗(On Diff #246848)

You're correct. I'm just going to remove this. The comment change isn't really related to the patch.

reames updated this revision to Diff 247010.Feb 27 2020, 9:52 AM

Remove unrelated comment, and a couple of splittable changes.

reames updated this revision to Diff 247044.Feb 27 2020, 11:33 AM

Restructure code triggered by discussion on D75268. If I simply admit that padding instructions is inherently target specific, we have a nice clean API for any form of target specific padding. As a separate patch, I'll add prefix padding to this API for x86. It also appears that Hexagon has a bundle nop padding (currently via finishLayout) which can be refactored into this API as well.

On top of this, I prototyped what prefix padding for relaxable instructions might look like in D75300.

reames marked an inline comment as done.Feb 27 2020, 1:50 PM
reames added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
682

Note to self: Since this is increasing the size of the instruction, we need to make sure we're not creating a branch boundary crossing case. If we are, we can just skip the optional expansion here.

skan added inline comments.Feb 29 2020, 1:00 AM
llvm/lib/MC/MCAssembler.cpp
1186–1187

Capitalize the first character of the variable it

1198–1200

F.getKind() == MCFragment::FT_Data || F.getKind() == FT_CompactEncodedInst
I think we can handle MCCompactEncodedInstFragment here.

1214–1215

The return type of getFragmentOffset and computeFragmentSize is uint64_t, so I suggest use uint64_t here is better

1266–1267

uint64_t

llvm/test/MC/X86/align-via-relaxation.s
32

around

skan added a reviewer: skan.Feb 29 2020, 1:06 AM
skan added inline comments.Feb 29 2020, 1:22 AM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
682

I think we add the code

if needAlignInst(Inst) return false;
auto *PF = cast_or_null<MCBoundaryAlignFragment>(RF.getPrevNode());
if(PF && PF.canEmitNops()) return false;

in the function canBeRelaxedForPadding to make sure we're not creating a branch boundary crossing case

skan added inline comments.Feb 29 2020, 2:29 AM
llvm/lib/MC/MCAssembler.cpp
1249–1251

We can avoid setting size for boundaryalign with D75404

MaskRay added inline comments.Feb 29 2020, 10:32 PM
llvm/lib/MC/MCAssembler.cpp
1186

for (MCSection &Sec : *this) {

1192

for (MCFragment &F : Sec) {

1198–1200

It seems we can use switch (F.getKind()) here.

1237

The comment can be moved before the if. Then the brace can be deleted.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
665

*RF.getSubtargetInfo() -> STI (it is a member variable)

668

SmallString<16> should be sufficient. (The SmallString used in MCAssembler::relaxInstruction should be fixed instead)

llvm/test/MC/X86/align-via-relaxation.s
2

The RUN line is unnecessarily indented. It is not common.

-pc-linux-gnu can be deleted to make the line length smaller.

7

.text and .section .text do the same thing. We can actually delete .file, .text and .section .text.

11

jmp is misaligned.

MaskRay added inline comments.Feb 29 2020, 10:41 PM
llvm/lib/MC/MCAssembler.cpp
793–799

I suspect we may have to do while (layoutOnce(Layout)) and optimizeLayout(Layout) in a lockstep.

optimizeLayout may cause some JCC_1/JMP_1 in MCRelaxableFragment to need relaxation.

794

We can dump the layout only when something has changed. This requires a change to optimizeLayout's return type.

1028

Unneeded

reames marked 7 inline comments as done.Mar 2 2020, 3:14 PM
reames added inline comments.
llvm/lib/MC/MCAssembler.cpp
793–799

The whole point of the design is that we don't have to iterate them. If you see a case where we do, please point it out; that's a bug.

See the comments about which starting offsets are changing and why we're not skipping over any relaxable fragment.

794

I see no value in this. I can make the change if you want, but I don't think it's actually useful.

1198–1200

I'd written it with a switch originally. It was much harder to read. I could use a switch for the filtering and then fall into the complicated logic for the align cases if you want?

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
665

I'd prefer to leave this as is. It follows the idiom of the other uses of relaxInstruction, and I haven't cross checked that the two subtarget infos are identical.

llvm/test/MC/X86/align-via-relaxation.s
7

We need to know it's a text section not a data section.

11

Er, what? Are you possibly thinking of the branch-align feature? That's not enabled in this test file.

32

No. That would change meaning of comment.

reames updated this revision to Diff 247736.Mar 2 2020, 3:18 PM

Address most of the stylistic comments. A further update to address the one functional bug is forthcoming.

reames updated this revision to Diff 247742.Mar 2 2020, 3:40 PM

Fix the functional bug and add tests for it.

reames updated this revision to Diff 247747.Mar 2 2020, 3:54 PM

Remove the relaxFragment loop. Now that boundary align fragments no longer need relaxed to update size, this is now just a nop.

reames updated this revision to Diff 247752.Mar 2 2020, 4:20 PM
reames retitled this revision from Relax existing instructions to reduce the number of nops needed for alignment purposes to [X86] Relax existing instructions to reduce the number of nops needed for alignment purposes.

Give up and just make all the code target specific. I'd originally hoped to apply this for other targets as well, but when looking at Hexagon (the only one with anything analogous), I realized that forcing a common implementation was just adding complexity for little value. Each backend has it's own set of constraints as to which padding is valuable. Instead of introducing new APIs, just follow the precedent that Hexagon already set.

Once this is in and backed a bit, we can revisit to see if any of it can be made target agnostic after all.

craig.topper added inline comments.Mar 2 2020, 4:49 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
748

it's ->its

771

The end of this line and into the next line doesn't read right. It says "a it's"

772

it's -> its

skan added inline comments.Mar 2 2020, 5:55 PM
llvm/test/MC/X86/align-via-relaxation.s
11

I think his meaning is "jmp" is misaligned in a text editor.

skan added a comment.Mar 3 2020, 1:02 AM

Remove the relaxFragment loop. Now that boundary align fragments no longer need relaxed to update size, this is now just a nop.

We may need add the relaxFragment loop back since the commit for D75404 was reverted.

reames updated this revision to Diff 247951.Mar 3 2020, 10:17 AM

Rebase and address grammar comments.

reames updated this revision to Diff 247952.Mar 3 2020, 10:20 AM

Remove tabs from test file to fix (textual) alignment.

reames marked 9 inline comments as done.Mar 3 2020, 10:21 AM
reames updated this revision to Diff 248004.Mar 3 2020, 1:09 PM

Add command line options to selectively disable. This is to ease performance regression triage post commit.

I think this is ready to land. I'm just waiting on an LGTM.

skan added inline comments.Mar 3 2020, 7:14 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
654

Is16BitMode

795

Capitalize i here otherwise clang tidy would report a warning

804

Capitalize i and N here otherwise clang tidy would report a warning

reames marked 3 inline comments as done.Mar 3 2020, 7:54 PM
reames added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
654

This code is copied from existing example. Will change both after commit.

795

"i" and "I" are different variables.

Also, if clang-tidy reports a warning for "i", clang-tidy has a bug.

804

No, clang-tidy is wrong.

skan accepted this revision.Mar 3 2020, 8:52 PM

LGTM , I applied this patch and passed the chek-all tests and found no new fails when running SPEC.

This revision is now accepted and ready to land.Mar 3 2020, 8:52 PM
This revision was automatically updated to reflect the committed changes.
MaskRay added subscribers: condy, vsk.EditedJan 11 2021, 3:26 PM

The patch locates MCRelaxableFragment's within two MCSymbol's and relaxes some MCRelaxableFragment's to reduce the size of a MCAlignFragment. The behavior is hence dependent on additional temporary labels due to -g.

@condy reported an example where clang -O1 -g and clang -O1 have different .text content
https://bugs.llvm.org/show_bug.cgi?id=42138#c13 (a MCRelaxableFragment (jmp) has 5 bytes with -O1 and 2 bytes with -O1 -g)
(There is also a thread https://lists.llvm.org/pipermail/llvm-dev/2021-January/147568.html CC @vsk for thoughts).

.p2align 4, 0x90 is common due to loops. For a larger program, with a lot of temporary labels (-g vs non -g), the assembly output difference may be quite destined.

--- a.s 2021-01-11 13:57:25.055152745 -0800
+++ b.s 2021-01-11 13:57:20.627140370 -0800
@@ -2,2 +2,3 @@
        .file   "czw.cc"
+       .file   1 "/tmp/c" "czw.cc"
        .globl  _ZN1k1lEv                       # -- Begin function _ZN1k1lEv
@@ -7,2 +8,3 @@
 .Lfunc_begin0:
+       .loc    1 26 0                          # czw.cc:26:0
        .cfi_startproc
@@ -11,2 +13,3 @@
 # %bb.0:                                # %entry
+       #DEBUG_VALUE: l:this <- $rdi
        pushq   %r15
@@ -27,2 +30,4 @@
 .Ltmp0:
+.Ltmp3:
+       #DEBUG_VALUE: l:this <- $rbx
        .cfi_escape 0x2e, 0x00
@@ -30,9 +35,17 @@
        movq    %rsp, %rdx
+.Ltmp4:
+       .loc    1 27 15 prologue_end            # czw.cc:27:15
        movl    $.L.str, %esi
        callq   _ZN1eIciEC1EPc9allocator

I am thinking of whether we can use some properties of MCSymbol to filter out some MCSymbol's in LabeledFragments but cannot find one which works (for example, IsUsedInReloc is only used in ObjectWriter, which is after the assembler optimization).

@condy: -mllvm -x86-pad-for-align=false is a workaround.

skan added a comment.Jan 11 2021, 5:52 PM

I am thinking of whether we can use some properties of MCSymbol to filter out some MCSymbol's in LabeledFragments but cannot find one which works (for example, IsUsedInReloc is only used in ObjectWriter, which is after the assembler optimization).

@condy: -mllvm -x86-pad-for-align=false is a workaround.

Option -x86-pad-for-align is for assembler optimization, I think we should set its default value to false and add some comments that it may interact with labels in text section. If someone turn on it explictly, the difference of machine code will be fine since it's a assembler optimization.

This review has been closed for nearly 10 months. Please move discussion of proposed changes to the bug or another relevant location.

Right, there's no fundamental reason why moving a label has to be forbidden -- but it'd be extremely complex if we allowed moving a label which could cause the re-layout of a fragment we thought we'd already finalized the offsets for. This would happen if the label offset required relaxation of some instruction/data referencing it. That, then, might require undoing the padding from an instruction we've already padded out, due to less alignment-padding being required overall.

So, maybe we can allow changing the address of labels that CAN'T cause such an issue, then? Can we identify them straightforwardly? We would need to avoid changing the offset of any label which might require further relaxation of instructions in a text segment (because that's all this pass can modify).

But that doesn't mean we can modify all labels in text which are referred to only by non-text fixups. Consider relaxable uleb128 data section containing offsets of labels in text sections -- naively it seems fine to allow further relaxation of the uleb128 data after running the padding...but unfortunately that's wrong, because if relaxable instructions in text point _back_ to labels in the uleb128 section, they may need to be relaxed when/if the address offsets in the uleb128 data grow. I think the criteria we'd need to use is something like: find all sections referenced by a relaxable fixup in the text section. Then find all sections referenced by a relaxable fixup in one of those sections, recursively. This is the set of sections that cannot point TO a label in text, if we want to be able to move it. That seems...complicated, itself.

Simpler, is it be valid to simply assume that text sections cannot ever refer, even transitively, to debug sections?

This review has been closed for nearly 10 months. Please move discussion of proposed changes to the bug or another relevant location.

Created https://bugs.llvm.org/show_bug.cgi?id=48742