Page MenuHomePhabricator

Align branches within 32-Byte boundary(NOP padding)
ClosedPublic

Authored by skan on Nov 12 2019, 6:50 PM.

Details

Summary

Microcode update for Jump Conditional Code Erratum may cause performance
loss for some workloads:

https://www.intel.com/content/www/us/en/support/articles/000055650.html

Here is the patch to mitigate performance impact by aligning branches
within 32-byte boundary. The impacted instructions are:

a. Conditional jump.
b. Fused conditional jump.
c. Unconditional jump.
d. Indirect jump.
e. Ret.
f. Call.

Add two options for llvm-mc:

  1. -x86-align-branch-boundary=NUM aligns branches within NUM byte boundary.
  2. -x86-align-branch=TYPE[+TYPE...] specifies types of branches to align.

to align branches within a 32-Byte boundary to reduce the potential performance
loss of the microcode update.

A new MCFragment type, MCBoundaryAlignFragment, is added, which may emit
NOP to align the fused/unfused branch.

alignBranchesBegin inserts MCBoundaryAlignFragment before instructions,
alignBranchesEnd marks the end of the branch to be aligned,
relaxBoundaryAlign grows or shrinks sizes of NOP to align the target branch.

Nop padding is disabled when the instruction may be rewritten by the linker,
such as TLS Call.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
skan added a comment.Dec 8 2019, 11:43 PM

We uncovered another functional issue with this patch, or at least, the interaction of this patch and other parts of LLVM. In our support for STATEPOINT, PATCHPOINT, and STACKMAP we use N-byte nop sequences for regions of code which might be patched out. It's important that these regions are exactly N bytes as concurrent patching which doesn't replace an integral number of instructions is ill-defined on X86-64. This patch causes the N-byte nop sequence to sometimes become (N+M) bytes which breaks the patching. I believe that the XRAY support may have a similar issue.

More generally, I'm worried about the legality of arbitrarily prefixing instructions from unknown sources. In the particular example we saw, we had something along the following:

.Ltmp0:
.p2align 3, 0x90
(16 byte nop sequence)
.Ltmp3:
jmp *%rax

In addition to the patching legality issue above, padding the nop sequence does something else interesting in this example. It changes the alignment of Ltmp3. Before, Ltmp3 was always 8 byte aligned, after prefixes are added, it's not. It's not clear to me exactly what the required semantics here are, but we at least had been assuming the alignment of Ltmp3 was guaranteed in this case. (That's actually how we found the patching issue.)

I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with your example. So according to my understanding, I slightly modified your case. (If my understand is wrong, I hope you can point it out :-). )

    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)
    .rept 2
    nop
    .endr
    jmp .Ltmp0

The instruction jmp .Ltmp0 starts at byte 0x1e and ends at byte 0x20. If we align the jump with preifx, two prefixes will be added to the .rept2 16 nop .endr. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the label '.Ltmp3' is not 8-byte aligned any more.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since the alignment is not explicitly required. For example, I think we can not assuming the instruction jmp .Ltmp0 always starts at byte 0x1e. And in fact, as long as the binary generated by the compiler does not have a one-to-one correspondence with the assembly code, at least one implict assumption of an instruction will be broken.

I could not reproduce the phenomenon that N-byte nop becomes (N+M) bytes with your example. So according to my understanding, I slightly modified your case. (If my understand is wrong, I hope you can point it out :-). )

    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)

In our case it was

andl $1, %eax

but it does not matter that much.

.rept 2
nop
.endr
jmp .Ltmp0
The instruction `jmp .Ltmp0` starts at byte 0x1e and ends at byte 0x20.

Again, in our particular case start of the sequence was at xxx8, so 8 + 16(our sequence) + 3(andl) + 5(jmp) == 32.

If we align the jump with preifx, two prefixes will be added to the .rept2 16 nop .endr. After prefixes are added, the 16-byte nop becomes 18-byte nop, then the label '.Ltmp3' is not 8-byte aligned any more.

Yes, thats what happened.

I doubt whether the assumption that '.Ltmp3' is 8-byte aligned is right, since the alignment is not explicitly required.

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

annita.zhang added a comment.EditedDec 9 2019, 5:14 AM

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

What if I insert explicit align(8) right *after* the sequence?

reames added a comment.Dec 9 2019, 5:17 PM

I just posted an alternate review (https://reviews.llvm.org/D71238) which attempts to carve out a minimum reviewable piece of complexity. The hope is that we can review that one quickly (as there are fewer interacting concerns), and then rebase this one (possibly splitting further).

I had previously suggested in review comments that we should reuse the infrastructure from .bundle_align_mode. When I sat down to actually implement that, I discovered that the code for that has a bunch of interacting assumptions about when fragments are constructed and used vs alignment boundaries. I got a version of this working, but the complexity was worrisome. I now suggest that we should take the rough approach sketched here (a separate fragment before the one being aligned), delete the essentially unused bundle mode code, and revisit a unified representation if needed for memory density at a later time. (i.e. my previous suggestion wasn't a good one)

skan added a comment.Dec 9 2019, 5:27 PM

The point is that we have explicit requirement at the start and we have a lowering into 16-byte sequence that we need to be preserved exactly as it is.
Essentially what we need is a "protection" for this sequence from any changes by machinery that generates the binary code.
How can we protect a particular byte sequence from being changed by this branch aligner?

No, in general we can't. The current solution is based on assembler to insert prefix or nop before the cross (or against) boundary branches. It can only ensure the explicit alignment specified by directive, but not any implicit alignment. I don't think any fixup based on assembler can do it. On the other hand, any code sequence after the alignment directive or even just in a function has some kind of implicit alignment. It's hard for assembler to tell which implicit alignment to preserve. The preferred way is to use explicit alignment directive to specify it.

For your scenario, a NOP padding is more controllable. NOP padding will be inserted just before the branch instructions (or macro fusion branch instructions). So if there's no branches (or macro fusion branches) in your code sequence, there will be no NOP inserted.

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

skan updated this revision to Diff 233008.Dec 9 2019, 11:48 PM

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

Well, I kinda figure that from the code behavior right now, however is it really by design or just happens to work now?
Seeing that assembler becomes very intelligent now I would rather have a strict guarantee similar to "hardcode" thing that you have here that protects my sequence
rather than relying on a fact that in current settings moving my label by 8 does not appear to be profitable to assembler.

skan added a comment.Dec 11 2019, 10:40 PM

What if I insert explicit align(8) right *after* the sequence?

If your insert explicit .align 8 after the sequence, and the sequence doesn't has any branch to be aligned, the current solution won't change the sequence.

Well, I kinda figure that from the code behavior right now, however is it really by design or just happens to work now?

It is by design, prefix won't be added to instruction if there is a align directive int the path to the target branch.

Seeing that assembler becomes very intelligent now I would rather have a strict guarantee similar to "hardcode" thing that you have here that protects my sequence
rather than relying on a fact that in current settings moving my label by 8 does not appear to be profitable to assembler.

It depends on what your real need is. If you just need your original implicit alignment to work, turning it into explicit alignment is enough. Or if you need a sequence not changed by
the assembler at all, maybe we need a new directive such as ".hard " or ".bundle".

skan updated this revision to Diff 233545.Dec 12 2019, 2:45 AM
  1. Replace divide and mod operation with bitwise operation
  2. Refine help information

I'm seeing lots of updates to fix bugs, but no movement for many days on both my meta comments and (in some ways more importantly) James's meta comments. (And thanks Philip for chiming in too!)

Meanwhile, we really, really need to get this functionality in place. The entire story for minimizing the new microcode performance hit hinges on these patches, and I'm really worried by how little progress we're seeing here.

Sorry for belated response. We're working hard to go through some paper work to get the performance data ready. I think maybe it's better to open a mailing thread in llvm-dev to post those performance data and discuss those suggestions.

The first data was posted in http://lists.llvm.org/pipermail/llvm-dev/2019-December/137413.html.

Thanks,
Annita

More performance data was posted on http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's move on based on the data.

skan updated this revision to Diff 234041.Dec 16 2019, 5:59 AM
  1. rename getSegmentPrefixSize(const MCInst &MI) to countSegmentPrefix(const MCInst &MI).
  2. remove unnecessary data member of X86AsmBackend std::vector<MCMachineDependentFragment *> PendingAlignmentFragments.
  3. Make the message of assert in MCMachineDependentFragment more clear.

More performance data was posted on http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's move on based on the data.

Based on the SPEC data, we observed 2.6% and 1.3% performance effect in INTRATE and FPRATE geomean respectively. Performance effect on individual components were observed up to 5.1%.

The tool SW mitigation can recover the geomean to within 99% of the original performance with prefix padding to jcc+jmp+fused. The maximum performance loss was reduced to within 2.2% of the original one.

The prefix padding can provide better performance as 0.3%~0.5% in geomean than nop padding on system with micro update. In individual cases, we observed up to 1.4% performance improvement in prefix padding. On a system w/o micro update, we observed 0.7% better performance of prefix padding on INTRATE geomean.

In this SPEC test, the prefix padding to jcc+jmp+fused and prefix padding to all branches has almost the same performance. However, we observed the latter prefix padding had a little bit better performance than the previous one in some cases at the cost of code size.

Since the performance delta in prefix padding and nop padding is incremental, starting from nop padding may be easier to implement as a first step, with additional prefix padding options to explore for additional performance optimizations.

And since the current update enables a relatively simple and general solution to mitigate JCC microcode update (MCU) performance effects to both C/C++ and Assembly without any source code change, we recommend it as a starting point. Then, the proposal for compiler to generate directives could enable further enhancements.

In offline discussion, there was an agreement that we needed further coordination to make sure this patch moves forward quickly. For that reason, there will be a call happening today at 4pm Pacific. Interested parties are welcome to attend.

Zoom Meeting ID: 507-497-8898
https://azul.zoom.us/j/5074978898

Results of the meeting will be summarized and posted here.

Haven't looked into too many details yet but made some suggestions anyway...

llvm/include/llvm/MC/MCFragment.h
727

Store AlignBoundarySize as a shift value, then needPadding doesn't even need to call Log2_64().

llvm/lib/MC/MCFragment.cpp
430

const auto *. The type is obvious according to the right hand side.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
101

An unknown value is just ignored, e.g. --x86-align-branch=unknown. I think there should be an error, but I haven't looked into the patch detail to confidently suggest how we should surface this error.

182

} else {

Should --x86-branches-within-32B-boundaries overwrite --x86-align-branch-boundary and --x86-align-branch and --x86-align-branch-prefix-size? My feeling is that it just provides a default value if either of the three options is not specified.

If you are going to remove addKind calls here, you can delete this member function.

193

space after if

No curly braces around simple statements.

636

isa<MCMachineDependentFragment>(F)

llvm/lib/Target/X86/X86InstrInfo.td
1020

Unintentional change not part of this patch?

llvm/test/MC/X86/align-branch-32-1a.s
1 ↗(On Diff #234041)

Did an older version include 32-1b.s or 32-1c.s? Now they are missing.

llvm/test/MC/X86/align-branch-64-1b.s
1 ↗(On Diff #234041)

Create test/MC/X86/Inputs/align-branch-64-1.s and reference it from 1[a-d].s via %S/Inputs/align-branch-64-1.s

Noting another issue we found in local testing (with an older version of this patch). This interacts badly with the implicit exception mechanism in LLVM. For that mechanism, we end up generating assembly which looks more or less like this:
Ltmp:

cmp %rsi, (%rdi)
jcc <target>

And a side table which maps TLmp to another label so that a fault at Ltmp can be interpreted as an extremely expensive branch via signal handler.

The problem is that the auto-alignment of the fused branch causes padding to be introduced which separate the label and the faulting exception, breaking the mapping.

Essentially, this comes down to an implicit assumption that the label stays tightly bundled with the following instruction.

This can happen with either nop or prefix padding.

Just wanted to say thanks for the performance data! I know it was hard to get, but it is really, really useful to help folks evaluate these kinds of changes with actual data around the options available.

skan marked an inline comment as done.Dec 16 2019, 5:17 PM
skan added inline comments.
llvm/lib/MC/MCFragment.cpp
430

Shall we keep consistent with the local code style? const MCLEBFragment *LF = cast<MCLEBFragment>(this); was used here.

Here are the minutes from our phone call a few minutes ago.

Attendees: Andy Kaylor, Craig Topper, Annita Zhang, Tom Stellard, Chandler Carruth, Fedor Sergeev, Philip Reames, Yuanake Luo

Status Summary

Performance data has been posted to llvm-dev. We had a side discussion about nop encoding, and it was mentioned these numbers were collected from runs targeting skylake (i.e. not generic x86). This is similar to the result we (Azul) have collected and shared summaries of previously.

GNU patch has landed Friday - this mostly fixes assembler command line.

Discussion on Approach

Three major options debated:

Assembler only - as in the current patch, assembler does all work, only command line flag
Explicit Directive only - as proposed in my alternate patch, compiler decides exactly what instructions get aligned
Region based directives - as proposed in James' last comment on review, directives enable and disable auto-padding in assembler

Use cases identified:

compiler users
    important than assembler is self contained (i.e. don't have to know compiler options for reproduceability)
    inline assembly looks a lot like assembler users
legacy assembler
    important that existing assembly works unmodified
assembler users
    "try it and see" model vs selective enable vs selective disable
    likely need to support all three

Consensus was that the region based directives met use cases the best. In particular, desire to be able to overrule default (for say, inline assembly or a JITs patchability assumptions) and then restore default. Default assembler behavior remains unchanged.

Stawman syntax proposal

.align_branch_boundary disable/default
.align_branch_boundary enable N, instructions (fused, jcc, jmp, etc..)

We need to ensure a consensus on syntax is shared w/gnu. Annita agreed to coordinate this.

Compiler would essentially just wrap generated assembly in directives.

Issue noticed while writing this up: proposed syntax assumes a default has been set, but doesn't give a way to set one. This would seem to break the desired reproducibility property for compiled code. Revision needed.

Push/Pop semantics were suggested at one point, but were thought to be non-idiomatic?

Other Topics

We very quickly discussed nop vs prefix performance. There was a clear consensus in starting with nop only patch and evolving later as needed.

Next Steps

Annita will refresh current patch with two key changes. 1) Drop prefix support and simplify and 2) drop clang driver support for now. Desire is to minimize cycle time before next iteration so that feedback on approach can be given while reviewers are still around.

Philip will prototype directive parsing support. Annita and Yuo (??) to handle coordination on syntax.

Suggested patch split:

(current patch) command line option to set default, nop only version w/cleaned up code as much as possible
assembler directive support (draft by Philip in parallel)
(future) compiler patch to wrap by default

Side note to Annita: For you to remove "hard code", you'll have to have a placeholder for the enable/disable interface. That should probably be split and rebased in my patch.

Here are the minutes from our phone call a few minutes ago.

Attendees: Andy Kaylor, Craig Topper, Annita Zhang, Tom Stellard, Chandler Carruth, Fedor Sergeev, Philip Reames, Yuanake Luo

Thanks for organization the meeting and making the summary.

Stawman syntax proposal

.align_branch_boundary disable/default
.align_branch_boundary enable N, instructions (fused, jcc, jmp, etc..)

...

Push/Pop semantics were suggested at one point, but were thought to be non-idiomatic?

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I confirm that the following 4 commits have been to the binutils-gdb repository (https://sourceware.org/ml/binutils/2019-12/msg00138.html https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=e379e5f385f874adb0b414f917adb1fc50e20de9).

gas: Add md_generic_table_relax_frag
i386: Align branches within a fixed boundary
i386: Add -mbranches-within-32B-boundaries
i386: Add tests for -malign-branch-boundary and -malign-branch

Here are the minutes from our phone call a few minutes ago.

Thanks for coordinating the meeting and having a clear summary. It helps a lot to accelerate the patch review. I really appreciate it!

Annita will refresh current patch with two key changes. 1) Drop prefix support and simplify and 2) drop clang driver support for now. Desire is to minimize cycle time before next iteration so that feedback on approach can be given while reviewers are still around.

Yes, we are working on it right now. Hopefully we can submit a new patch today or tomorrow.

Philip will prototype directive parsing support. Annita and Yuo (??) to handle coordination on syntax.

I suppose it's Annita and Fangrui

Side note to Annita: For you to remove "hard code", you'll have to have a placeholder for the enable/disable interface. That should probably be split and rebased in my patch.

Let's do it in your directive patch.

skan added a comment.Dec 17 2019, 12:24 AM

Noting another issue we found in local testing (with an older version of this patch). This interacts badly with the implicit exception mechanism in LLVM. For that mechanism, we end up generating assembly which looks more or less like this:
Ltmp:

cmp %rsi, (%rdi)
jcc <target>

And a side table which maps TLmp to another label so that a fault at Ltmp can be interpreted as an extremely expensive branch via signal handler.

The problem is that the auto-alignment of the fused branch causes padding to be introduced which separate the label and the faulting exception, breaking the mapping.

Essentially, this comes down to an implicit assumption that the label stays tightly bundled with the following instruction.

This can happen with either nop or prefix padding.

How about insert NOP before the label Ltmp?

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I rethink about the directives and prefer the .push/.pop pair as @MaskRay suggested. To be specified, I'd suggest to use .push_align_branch_boundary and .pop_align_branch_boundary to align with MC command line options. They will cowork with the command line options and overwrite the options if both are existing.

To be clarified, I described the behavior of the directives from my understanding. Feel free to speak if you have difference opinion.

.push_align_branch_boundary [N,] [instruction,]*

This directive specifies the beginning of a region which will overwrite the value set by the command line or by the previous directive. It can represent either an enabling or disabling directive controlled by parameter N. 
N indicates to align the branches within N byte boundary. The default value is 32. If N is 0, it means the branch alignment is off within this region. 
Instruction specifies types of branches to align. The value is one or multiple values from fused, jcc, jmp, call, ret and indirect. The default value is fused, jcc and jmp. (may change later)

.pop_align_branch_boundary

This directive specifies the end of a region to align branch boundary. The status will be back to which was set by the previous directive or the one set by the command line if there's no previous directive existing.

I will coordinate with GNU binutils community once we discuss and have agreement with the directives.

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

Whether this is OK or not on a particular piece of assembly-code is likely to be a generic property of the code, regardless of the purpose of the optimization. If we're going to have multiple assembler optimizations that can make use of this, it would be nice to express the "OK to pad" "not OK to pad" property only once, rather than once for each kind of optimization which might make such modifications.

In particular, I'd like to look ahead towards the potential implementation of two other features:

  1. Allowing the assembler to prefix-pad instructions in order to avoid having to emit a NOP for p2align directives.
  2. Allowing the assembler to do other instruction-padding performance optimizations to avoid other DSB cacheline limits.

To be concrete, I propose:
".autopad", ".noautopad": allow/disallow the assembler to emit padding via inserting a nop or prefix before any instruction, as needed.
".align_branch_boundary [N,] [instruction,]": Enable branch-boundary padding (per previous description).

In this scheme, I'd generally expect an ".align_branch_boundary" directive to be specified once at the beginning of the file, and ".autopad"/".noautopad" directives to be sprinkled throughout the file as required.

skan updated this revision to Diff 234312.Dec 17 2019, 8:58 AM
skan retitled this revision from Align branches within 32-Byte boundary to Align branches within 32-Byte boundary(NOP padding).
skan edited the summary of this revision. (Show Details)

Simplify

  1. Drop prefix padding support
  2. Drop clang driver support
  3. Drop default align option -x86-branches-within-32B-boundaries
  4. Drop hardcode support

Other

  1. Throw an error if an illegal value is passed to option -x86-align-branch
  2. Use llvm::Align instead of uint64_t to store the information about boundary
  3. Remove test cases for prefix padding and add more test cases for NOP padding
  4. Rename MCMachineDependentFragment to MCBoundaryAlignFragment

There is a precedant: .pushsection/.popsection (MCStreamer::SectionStack). With .push_align_branch/.pop_align_branch, we probably don't need the 'switch-to-default' action.

I don't know how likely we may ever need nested states (e.g. an .include directive inside an .align_branch region where the included file has own idea about branch alignment), but .push/.pop does not seem to be more complex than disable/enable/default.

I rethink about the directives and prefer the .push/.pop pair as @MaskRay suggested. To be specified, I'd suggest to use .push_align_branch_boundary and .pop_align_branch_boundary to align with MC command line options. They will cowork with the command line options and overwrite the options if both are existing.

I agree that we need the push/pop semantics.

To be clarified, I described the behavior of the directives from my understanding. Feel free to speak if you have difference opinion.

.push_align_branch_boundary [N,] [instruction,]*

This directive specifies the beginning of a region which will overwrite the value set by the command line or by the previous directive. It can represent either an enabling or disabling directive controlled by parameter N. 
N indicates to align the branches within N byte boundary. The default value is 32. If N is 0, it means the branch alignment is off within this region. 
Instruction specifies types of branches to align. The value is one or multiple values from fused, jcc, jmp, call, ret and indirect. The default value is fused, jcc and jmp. (may change later)

I'd remove the defaults. Let's just be explicit about what is being enabled/disabled.

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

James, I think this proposal is increasing the scope of this proposal too much. It also ignores some of the use cases identified and described in the writeup (i.e. the scoped semantics). I'm open to discussing such a feature more generally, but I'd prefer to see a more narrowly focused feature immediately.

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

  1. BoundaryAlign w/Target = 3
  2. DataFragment containing TEST RAX, RAX
  3. RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

.push_align_branch_boundary [N,] [instruction,]*

I'd like to raise again the possibility of using a more general region directive to denote "It is allowable to add prefixes/nops before instructions in this region if the assembler wants to", as I'd started discussing in https://reviews.llvm.org/D71238#1786885 (but let's move the discussion here).

James, I think this proposal is increasing the scope of this proposal too much. It also ignores some of the use cases identified and described in the writeup (i.e. the scoped semantics). I'm open to discussing such a feature more generally, but I'd prefer to see a more narrowly focused feature immediately.

I do not intend that we expand the scope of the project to include any of the other features.

All I want is to slightly consider surrounding features when adding the new assembly syntax. The situations where we want to avoid modifying a certain block of code are extremely likely to apply to any nop-or-prefix-introducing code modifications -- not just modifications resulting from branch alignment. So if we can make the directives annotating where such changes are allowable (and conversely, where they are not) generally-applicable, with a very minimal amount of work, that would be nice.

I also don't understand what you mean by "it ignores [...] scoped semantics"?

I've updated the draft assembler support in https://reviews.llvm.org/D71315 to match the proposal here. Again, this is very much WIP and mostly just to show proposed syntax/usage.

skan added a comment.EditedDec 17 2019, 9:25 PM

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

  1. BoundaryAlign w/Target = 3
  2. DataFragment containing TEST RAX, RAX
  3. RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

JUMP is not always emiteed into MCRelaxableFragment, it also can be emitted into MCDataFragment and arithmetic ops with constant arguments of unknown value (e.g. ADD,AND) can be emitted into MCRelaxableFragment , you can find related code in MCObjectStreamer::EmitInstructionImpl, X86AsmBackend::mayNeedRelaxation. Let's say JCC is fused with TEST, there are four possible positions for JCC and CMP

  1. JCC and CMP are in same MCDataFragment
  2. JCC and CMP are in two different MCDataFragment
  3. JCC and CMP are in two different MCRelaxableFragment
  4. JCC in a MCRelaxableFragment, CMP is in a MCDataFragment

and since MCCompactEncodedInstFragment is not applicable yet, i don't what's its behaviour.

In order to compute the total size of CMP and JCC in MCAssembler::relaxBoundaryAlign, I insert a FusedJccSplit to force CMP and JCC in two fragments. Do you have any better idea?

skan updated this revision to Diff 234488.Dec 18 2019, 2:01 AM
skan edited the summary of this revision. (Show Details)

Simplify

Drop the subtype of MCBoundaryAlignFragment and add data member EmitNops to indicate whether NOPs should be emitted.

skan updated this revision to Diff 234515.Dec 18 2019, 5:00 AM
  1. rename MCBoundaryAlignFragment::hasEmitNop() to MCBoundaryAlignFragment::canEmitNop()
  1. reduce the number of MCBoundaryAlignFragment emitted as possible
skan updated this revision to Diff 234520.Dec 18 2019, 5:20 AM

Fix a typo in MCFragment::dump()

skan updated this revision to Diff 234650.Dec 18 2019, 7:38 PM

move the code that checks if we can reuse the current MCBoundaryAlignFragment into the function X86AsmBackend::getOrCreateBoundaryAlignFragment

skan updated this revision to Diff 234697.Dec 19 2019, 4:08 AM

Add more comment

Specifically on the revised patch, I remain confused by the need for multiple subtypes. The need for fragments *between* the potentially fused instructions doesn't make sense to me. What I was expecting to see was the following:
BoundaryAlign w/target=the branch fragment
.. some possibly empty sequence of fragments (i.e. the test/cmp/etc..) ...
the branch fragment
a new data fragment if the branch fragment was a DF

(i.e. a single BounaryAlign fragment which aligns a payload which is defined as "next fragment to target fragment inclusive".)

To be specific, I'd expect to see the following for an example fused sequence:

  1. BoundaryAlign w/Target = 3
  2. DataFragment containing TEST RAX, RAX
  3. RelaxeableFragment containing JNE symbo

Why do we need anything between the two fragments of the fused pair?

(As a reminder, I am new to this code. If I'm missing the obvious, please just point it out.)

JUMP is not always emiteed into MCRelaxableFragment, it also can be emitted into MCDataFragment and arithmetic ops with constant arguments of unknown value (e.g. ADD,AND) can be emitted into MCRelaxableFragment , you can find related code in MCObjectStreamer::EmitInstructionImpl, X86AsmBackend::mayNeedRelaxation. Let's say JCC is fused with TEST, there are four possible positions for JCC and CMP

  1. JCC and CMP are in same MCDataFragment
  2. JCC and CMP are in two different MCDataFragment
  3. JCC and CMP are in two different MCRelaxableFragment
  4. JCC in a MCRelaxableFragment, CMP is in a MCDataFragment

and since MCCompactEncodedInstFragment is not applicable yet, i don't what's its behaviour.

In order to compute the total size of CMP and JCC in MCAssembler::relaxBoundaryAlign, I insert a FusedJccSplit to force CMP and JCC in two fragments. Do you have any better idea?

I agree there are multiple cases here, see the original generic description instead of the specific example. The general question is why a *range* of fragments can't be defined. Computing the instruction size for the entire range then just requires walking from first to last fragment in the range summing the size of each. If both instructions are within the same data fragment, then no relaxation is needed, and the size of both is simply the size of the data fragment.

Unless I'm missing something here?

reames accepted this revision.Dec 19 2019, 2:51 PM

LGTM. The patch isn't perfect, but this has reached the point where landing and iterating in tree is the fastest way to convergence. To be clear, this LGTM *only* applies to the current patch contents, and the *internal only* flag names this introduces. These may change before we expose this publicly.

Warts with the current patch we should iterate to address:

  • Stylistic issues w.r.t. comments, naming, static vs member functions remain. None are show stoppers.
  • Many of the tests can probably be simplified and condensed.
  • The bundling scheme used is probably more complicated than needed (see previous suggestions).
  • This patch doesn't include anyway to locally disable padding, and thus is *known incorrect* in some cases. As this remains off by default, this is not a blocker for commit.

p.s. I spoke w/James this morning, and we came up with some revisions in approach he's going to suggest separately. We did explicitly discuss the status of the current patch, and whether it needed further review cycles or could be iterated in tree. Normally, I'd wait for him to summarize that conversation publicly, but given the time delay and vacation schedules involved, I want to avoid loosing another day cycle.

llvm/include/llvm/MC/MCFragment.h
662

Please add a note that the size is lazily set during relaxation, and is not meaningful before that.

llvm/test/MC/X86/align-branch-32-1a.s
34 ↗(On Diff #234697)

For testing alignment functionality, we can probably use a repl int3 pattern here. It would make the tests much more concise, and shouldn't effect the logic being (currently) tested.

To be concrete, I propose:
".autopad", ".noautopad": allow/disallow the assembler to emit padding via inserting a nop or prefix before any instruction, as needed.
".align_branch_boundary [N,] [instruction,]": Enable branch-boundary padding (per previous description).

I had thought I sent the comments yesterday, but I didn't. :( I think my comments are aligned with the conclusion that Philip and Jame got. Wait for Jame's summary.

In general, I think it's a good idea to have generic directives to control the padding behavior in assembler. ".autopad", ".noautopad", ".align_branch_boundary" looks good except it can't handle the nested scenarios. I can imagine the nested cases in assembly file which includes another assembly file. If we want to handle it, we need to have a pair of directives for each above. It will make the semantics complicated. We need elaborately design it.
I'd echo what Philip said. We want a more focused and basic implementation here. We're very open to have more discussion on generic directives. However, I'd prefer it's a separate topic from this one.

skan added a comment.Dec 19 2019, 6:16 PM

The general question is why a *range* of fragments can't be defined. Computing the instruction size for the entire range then just requires walking from first to last fragment in the range summing the size of each. If both instructions are within the same data fragment, then no relaxation is needed, and the size of both is simply the size of the data fragment.

Unless I'm missing something here?

Thank you for your detailed explanation, the solution has been consistent with your suggestion now.

MaskRay added inline comments.Dec 19 2019, 9:40 PM
llvm/lib/MC/MCAssembler.cpp
1032

be relaxed

1036

uint64_t

1038

auto -> MCFragment

1042

unsigned -> int

1046

uint64_t

skan marked an inline comment as done.Dec 19 2019, 10:59 PM
skan added inline comments.
llvm/lib/MC/MCAssembler.cpp
1042

Why use int here?

skan updated this revision to Diff 234830.Dec 19 2019, 11:42 PM
  1. Simplify the test cases.
  2. Add some comments
skan marked 6 inline comments as done.Dec 19 2019, 11:44 PM
MaskRay added inline comments.Dec 20 2019, 12:04 AM
llvm/lib/MC/MCAssembler.cpp
1042

The loop variable can only be 0 or 1. 0U 1U 2U the unsigned suffix are just redundant. int suffices. It also improves readability a bit by avoiding auto.

skan added a comment.Dec 20 2019, 12:29 AM

Do you think this patch is ready to land? @MaskRay

Do you think this patch is ready to land? @MaskRay

It is 00:35 here and I should confess that I haven't read through this yet.

There are some minor things like (1) pervasive auto (2) report_fatal_error in X86AlignBranchKind is not suitable. I expect the error is reported at a command line parsing stage. (3) I don't see how __tls_get_addr is referenced in code but somehow it magically appears in tests. It may be related to VK_NONE but there should be more tests for !VK_NONE cases. (4) More function-level comments are needed.

If it is not super urgent, probably want a couple of hours to get a response from @reames or @jyknight whether they think this can be committed as is, with iteration work in the future?

(Also note that @fedor.sergeev has requested changes. The convention is to wait for @fedor.sergeev to agree that this can be accepted... but there are vacation schedules involved.)

llvm/lib/MC/MCAssembler.cpp
1048

Align

skan updated this revision to Diff 234846.Dec 20 2019, 2:41 AM
  1. Add more tests for !VK_NONE cases.
  2. Reduce pervasive auto
skan marked an inline comment as done.Dec 20 2019, 2:42 AM
This revision was not accepted when it landed; it landed in state Needs Review.Dec 20 2019, 11:41 AM
This revision was automatically updated to reflect the committed changes.

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

  1. We need to settle on assembler syntax.
  2. We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)
  3. We can incrementally add support for prefix padding.
MaskRay added a comment.EditedDec 20 2019, 5:00 PM

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

  1. We need to settle on assembler syntax.
  2. We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)
  3. We can incrementally add support for prefix padding.

If you planned to clean up, you could have made the cleanups in the original land:) Though there would be a problem of proper contribution attribution. After Subversion->Git transition, we can actually run `git commit --amend --author=ask-author-to-provide-name-and- <email>' (Though changing the author may not be fair to your contribution to this commit... oh it is so difficult.) Anyway, it is nice to see this feature before Christmas and people can start investigating its impact now.

I've gone ahead and landed the patch so that we can iterate in tree. See commit 14fc20ca62821b5f85582bf76a467d412248c248.

I've also landed a couple of follow up patches to address issues which would have otherwise required iteration on the review. See commits c148e2e2ef86f53391be459752511684424f331b, 4024d49edc1598a6f8017df541147b38bf1e2818, and 8b725f0459eee468ed7f9935fba3278fcb4997b1.

I still see some room for further cleanup (i.e. the fragment range scheme and tests), but what's in is of reasonable quality.

There's a couple follow up patches which are probably called for, but I think we can work on these in parallel now.

  1. We need to settle on assembler syntax.
  2. We need a patch for the x86 MI to MC translation to mark regions unsafe to pad. (Probably best to separate from the above for the moment.)
  3. We can incrementally add support for prefix padding.

Thanks for landing it! And happy holiday to all!