This is an archive of the discontinued LLVM Phabricator instance.

[MC] Introduce NeverAlign fragment type
Needs ReviewPublic

Authored by Amir on Mar 4 2021, 2:29 PM.

Details

Summary

Introduce NeverAlign fragment type.

The intended usage of this fragment is to insert it before a pair of
macro-op fusion eligible instructions. NeverAlign fragment ensures that
the next fragment (first instruction in the pair) does not end at a
given alignment boundary by emitting a minimal size nop if necessary.

In effect, it ensures that a pair of macro-fusible instructions is not
split by a given alignment boundary, which is a precondition for
macro-op fusion in modern Intel Cores (64B = cache line size, see Intel
Architecture Optimization Reference Manual, 2.3.2.1 Legacy Decode
Pipeline: Macro-Fusion).

This patch introduces functionality used by BOLT when emitting code with
MacroFusion alignment already in place.

The use case is different from BoundaryAlign and instruction bundling:

  • BoundaryAlign can be extended to perform the desired alignment for the

first instruction in the macro-op fusion pair (D101817). However, this
approach has higher overhead due to reliance on relaxation as
BoundaryAlign requires in the general case - see
https://reviews.llvm.org/D97982#2710638.

  • Instruction bundling: the intent of NeverAlign fragment is to prevent

the first instruction in a pair ending at a given alignment boundary, by
inserting at most one minimum size nop. It's OK if either instruction
crosses the cache line. Padding both instructions using bundles to not
cross the alignment boundary would result in excessive padding. There's
no straightforward way to request instruction bundling to avoid a given
end alignment for the first instruction in the bundle.

Diff Detail

Event Timeline

Amir created this revision.Mar 4 2021, 2:29 PM
Amir requested review of this revision.Mar 4 2021, 2:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2021, 2:29 PM
Amir updated this revision to Diff 328332.Mar 4 2021, 4:23 PM
Amir edited reviewers, added: hoy, grosbach; removed: maksfb, rafauler.
Amir added subscribers: maksfb, rafaelauler.

Formatting fixes

Amir updated this revision to Diff 328583.Mar 5 2021, 10:35 AM
hoy added a comment.Mar 7 2021, 12:56 PM

Is it the first Bolt upstreaming patch? Good to see it!

Can you update the summary with the semantics of FT_NeverAlign?

Also, I'm not seeing this new type of MCFragment is used anywhere in this patch. Looks like the real usage will come with upcoming changes. I'm wondering if it is possible to move some of the upcoming changes there so that this patch is self-contained and can be tested by a regression test?

hoy added a subscriber: wenlei.Mar 7 2021, 12:57 PM
Amir added a comment.Mar 10 2021, 11:23 AM

@hoy, thank you for your suggestions! Yes, it's one of the first BOLT patches, but not the first :)
We've decided with @maksfb to upstream our alignment-for-macrofusion logic as well, will include it into the follow-up revision. This would enable us to produce a test case.

Amir edited the summary of this revision. (Show Details)Mar 10 2021, 12:50 PM
Amir added a subscriber: skan.Mar 25 2021, 8:01 PM

We've considered two options to integrate this functionality into LLVM:

  1. Similar to JCC errata/BoundaryAlign fragment:

Automatically insert NeverAlign into the section on cmp instruction, check if fuseable instruction follows it, and remove the fragment otherwise.

  1. Similar to the current usage in BOLT: https://github.com/facebookincubator/BOLT/blob/68abc968b706b55585b1b8be315aef5d3bf90b1c/bolt/src/BinaryEmitter.cpp#L445

Check basic block instructions in advance, only emit NeverAlign fragment if fuseable sequence is found.

Pros and cons of each approach:

  1. Pro is that macro-op fusion alignment will be applied to inline assembly as well as LLVM-produced MCs, the downside is that I've observed a lot of insertions and removals of NeverAlign fragment, which may hurt processing time.
  2. Pro is that the check and insertion are happening once per basic block, the downside is that we need to integrate the logic for determining the insertion of this fragment higher up in the MCStreamer stack where we see the basic block.

Thoughts?
+CC @skan

The patch should come with with tests.

We probably want to avoid another feature which does complex things in emitInstructionBegin/emitInstructionEnd, so (1) helping just inline asm does not seem a good justification to me. Perhaps (2) should be used.

llvm/include/llvm/MC/MCFragment.h
338

Don’t duplicate function or class name at the beginning of the comment.

https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments

llvm/lib/MC/MCAssembler.cpp
394

The logic is wrong if NextFrag->getContents().size() % NAF.getAlignment() == 0.
In that case you want to make the value 0

400

If neither, return 0

407

Store Size % getBackend().getMinimumNopSize() into a variable, check whether it is zero, instead of using a loop which increases size one at once.

skan added a comment.Mar 26 2021, 12:17 AM

We've considered two options to integrate this functionality into LLVM:

  1. Similar to JCC errata/BoundaryAlign fragment:

Automatically insert NeverAlign into the section on cmp instruction, check if fuseable instruction follows it, and remove the fragment otherwise.

  1. Similar to the current usage in BOLT: https://github.com/facebookincubator/BOLT/blob/68abc968b706b55585b1b8be315aef5d3bf90b1c/bolt/src/BinaryEmitter.cpp#L445

Check basic block instructions in advance, only emit NeverAlign fragment if fuseable sequence is found.

Pros and cons of each approach:

  1. Pro is that macro-op fusion alignment will be applied to inline assembly as well as LLVM-produced MCs, the downside is that I've observed a lot of insertions and removals of NeverAlign fragment, which may hurt processing time.
  2. Pro is that the check and insertion are happening once per basic block, the downside is that we need to integrate the logic for determining the insertion of this fragment higher up in the MCStreamer stack where we see the basic block.

Thoughts?
+CC @skan

Some questions.

  1. Is the purpose of this patch to avoid cmp ends and jmp starts exactly at 64b alignment? If it's the only purpose, I think we can reuse the BoundaryAlign fragment by adding a function like "needPadding" in MCAssembler.cpp.
  2. Could you also upstream the code that inserts/remove the NeverAlign fragment and some test cases? Usage can help us determine if the design of the fragment is reasonable.
Amir added a comment.Mar 29 2021, 6:51 PM

@MaskRay, @skan, appreciate your comments and suggestions!
@MaskRay:

The patch should come with with tests.

Agree, still trying to find a way to test added functionality outside of BOLT's use case. Thank you for reviewing this patch and BOLT upstreaming repo!

@skan:

Is the purpose of this patch to avoid cmp ends and jmp starts exactly at 64b alignment? If it's the only purpose, I think we can reuse the BoundaryAlign fragment by adding a function like "needPadding" in MCAssembler.cpp.

Yes, that's the purpose. We've looked at BoundaryAlign and aligned instruction bundles, but failed to come up with a simple solution. How do you envision BoundaryAlign avoiding the given alignment boundary falling between cmp and jmp? NeverAlign inserts a one-byte nop before cmp if it would end at a given boundary.

Could you also upstream the code that inserts/remove the NeverAlign fragment and some test cases? Usage can help us determine if the design of the fragment is reasonable.

We seek to upstream LLVM parts required by BOLT before pushing BOLT.
The usage is shown here: https://github.com/facebookincubator/BOLT/blob/68abc968b706b55585b1b8be315aef5d3bf90b1c/bolt/src/BinaryEmitter.cpp#L445
We analyze the basic block for macro-op fusion pairs; if such a pair is found, we insert NeverAlign fragment after the first instruction (cmp) with a call MCStreamer::emitNeverAlignCodeAtEnd(). That's the only external use of the interfaces we add, which makes it non-trivial to test. That's why I'm seeking to add a way to trigger the insertion of NeverAlign fragment (see comment above).

skan added a comment.Mar 29 2021, 8:34 PM

@skan:

Is the purpose of this patch to avoid cmp ends and jmp starts exactly at 64b alignment? If it's the only purpose, I think we can reuse the BoundaryAlign fragment by adding a function like "needPadding" in MCAssembler.cpp.

Yes, that's the purpose. We've looked at BoundaryAlign and aligned instruction bundles, but failed to come up with a simple solution. How do you envision BoundaryAlign avoiding the given alignment boundary falling between cmp and jmp? NeverAlign inserts a one-byte nop before cmp if it would end at a given boundary.

Could you also upstream the code that inserts/remove the NeverAlign fragment and some test cases? Usage can help us determine if the design of the fragment is reasonable.

We seek to upstream LLVM parts required by BOLT before pushing BOLT.
The usage is shown here: https://github.com/facebookincubator/BOLT/blob/68abc968b706b55585b1b8be315aef5d3bf90b1c/bolt/src/BinaryEmitter.cpp#L445
We analyze the basic block for macro-op fusion pairs; if such a pair is found, we insert NeverAlign fragment after the first instruction (cmp) with a call MCStreamer::emitNeverAlignCodeAtEnd(). That's the only external use of the interfaces we add, which makes it non-trivial to test. That's why I'm seeking to add a way to trigger the insertion of NeverAlign fragment (see comment above).

I think most of the logic should be already there for the "NeverAlign" purpose. My concern about the usage of MCStreamer::emitNeverAlignCodeAtEnd() is that it ignores some corner cases( see the check in X86AsmBackend::emitInstructionBegin), which may cause correctness issues. BoundaryAlign fragment is designed for JCC erratum originally, and the logic whether the macro-fused pair crosses the boundary and the length of nop to be inserted lie in the function MCAssembler::relaxBoundaryAlign. As far as I can see, the interface of MCStreamer::emitNeverAlignCodeAtEnd() is unnecessary and simple customization of relaxBoundaryAlign can achive your purpose easily (setting the size of BoundaryAlign fragment to 1 if given alignment boundary falling between macro-fused pair). In addition, reusing the BoundaryAlign can make the test trival and MC/X86/align-branch*.s gives some examples.

Amir added a comment.Apr 22 2021, 6:16 PM

@skan: I've experimented with adding customization of BoundaryAlign fragment and conducted some tests to see if it's functionally identical and has similar processing time.
I've implemented two options: automatic alignment (similar to JCC erratum mitigation with -x86-branches-within-32B-boundaries), and client insertion: exposed interfaces to insert BoundaryAlign into OS from the client (identical to current use of NeverAlign fragment in BOLT).
Automatic alignment has noticeably higher overhead than current alignment with NeverAlign, which is expected since there are more fragments inserted (BoundaryAlign fragments are inserted on every cmp/test instruction).
However, even with client insertion (BoundaryAlign fragment is inserted using the same logic as NeverAlign), BoundaryAlign still has higher overhead. One test binary has up to ~40% increase in processing time (1:09.36elapsed for NeverAlign vs 1:38.23 for BoundaryAlign).
The overhead comes from the fact that BoundaryAlign relies on relaxation to determine its size and invalidates all following fragments, while NeverAlign uses simpler logic similar to FT_Align which doesn't trigger invalidation. At this point I don't think we should pursue this direction further as the processing time is important for BOLT. I can submit a diff with automatic alignment for macro-fusion for a review, and it might be of use in a compiler/assembler (maybe as -O3 optimization?), but it won't be used by BOLT.
Thank you for suggesting to try to reuse BoundaryAlign and pointing out various correctness issues handled by BoundaryAlign.

To continue with this review: in order to produce a test case, I'll expose an assembly directive to insert NeverAlign fragment (it's hard to test otherwise outside of BOLT). I'll address the comments and update soon.

Amir updated this revision to Diff 340739.Apr 27 2021, 12:27 AM
Amir edited the summary of this revision. (Show Details)

Addressed comments by @MaskRay, added X86 assembler directive, added a testcase using the directive

Amir updated this revision to Diff 341009.Apr 27 2021, 3:40 PM

Rebased

Amir marked 4 inline comments as done.Apr 27 2021, 3:41 PM
Amir updated this revision to Diff 341029.Apr 27 2021, 4:52 PM

Lost the original diff in the rebase, fixed that

skan added a reviewer: skan.Apr 28 2021, 12:21 AM
skan added inline comments.Apr 28 2021, 1:52 AM
llvm/include/llvm/MC/MCFragment.h
343–350

If MCNeverAlignFragment only emits nop in your usage scenarios, the members EmitNops, Value and ValueSize can be removed. Otherwise, there may be a lot of untested code.

360–365

Remove these interfaces.

llvm/lib/MC/MCAssembler.cpp
402–404

The summary and comments in the code need to be updated, you insert a minimum-size nop rather than a single one-byte nop, although they are same for most cases.

661–679

Line 611-628 are redundant and untested.

llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp
4743–4750

Could you add test cases for line 4743-4750?

llvm/test/MC/X86/x86_64-directive-avoid_end_align.s
1 ↗(On Diff #341029)

Add "--no-show-raw-insn" after llvm-objdump since you do not need to check instruction encoding.

5–6 ↗(On Diff #341029)

If you remove line 5-6 and change ".nops 59" to ".nops 62", the purpose of the test can be more clear. And you need to check there is the extra nop is inserted only when cmp ends and jmp starts exactly at 64b alignment.

13–16 ↗(On Diff #341029)

Line 13-16 can be removed.

skan added inline comments.Apr 28 2021, 5:00 AM
llvm/lib/MC/MCAssembler.cpp
396–397

Should this be a llvm_unreachable?

Amir updated this revision to Diff 341364.Apr 28 2021, 5:36 PM
Amir edited the summary of this revision. (Show Details)

Addressed comments by @skan, added test case for errors in .avoid_end_align directive

Amir marked 9 inline comments as done.Apr 28 2021, 5:37 PM
skan added inline comments.Apr 28 2021, 7:25 PM
llvm/lib/MC/MCAssembler.cpp
391–392

You can add test case like

.L0:
.nops 57
  int3
.avoid_end_align 64
  cmpl $(.L1-.L0), %eax
  je  .L0
.nops 65
.L1:

to check the situation where the fragment after NeverAlign is a RelaxableFragment.

Amir updated this revision to Diff 341413.Apr 28 2021, 11:48 PM

Added a test case suggested by @skan with NeverAlign followed by RelaxableFragment

Amir marked an inline comment as done.Apr 28 2021, 11:48 PM
Amir added inline comments.
llvm/lib/MC/MCAssembler.cpp
391–392

Thank you!

Amir marked an inline comment as done.Apr 29 2021, 12:32 AM
skan accepted this revision.Apr 29 2021, 12:33 AM

LGTM

This revision is now accepted and ready to land.Apr 29 2021, 12:33 AM
Amir added a comment.Apr 29 2021, 1:00 AM

Shengchen @skan, Fangrui @MaskRay: thank you for detailed comments and suggestions! I don't have commit access, can you please commit this for me?

skan added a comment.Apr 29 2021, 1:30 AM

Shengchen @skan, Fangrui @MaskRay: thank you for detailed comments and suggestions! I don't have commit access, can you please commit this for me?

Better to give other reviewers one or two days to confirm.

Amir added a comment.Apr 30 2021, 8:29 PM

Do I need to add some documentation for the new assembler directive?

llvm/lib/MC/MCAssembler.cpp
399

Better reuse isAgainstBoundary

Amir updated this revision to Diff 342654.May 3 2021, 11:47 PM

Moved up and reused isAgainstBoundary.

hoy added a comment.May 4 2021, 12:29 PM

@Amir Let me know if you'd like me to commit this for you.

MaskRay added inline comments.May 4 2021, 12:33 PM
llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp
4745

No need to mention the directive name. The diagnostic uses caret to highlight to user input.

4749

ditto

4753

The code is self-explanatory

MaskRay requested changes to this revision.May 4 2021, 12:34 PM
MaskRay added inline comments.May 4 2021, 12:34 PM
llvm/test/MC/X86/x86_64-directive-avoid_end_align.s
1 ↗(On Diff #342654)

Remove x86_64- from the filename.

This revision now requires changes to proceed.May 4 2021, 12:34 PM
MaskRay added inline comments.May 4 2021, 12:39 PM
llvm/include/llvm/MC/MCObjectStreamer.h
142

Why "AtEnd"?

llvm/test/MC/X86/x86_64-directive-avoid_end_align_errors.s
1 ↗(On Diff #342654)

You can use .ifdef ERR to place negative tests into the main test file.

5 ↗(On Diff #342654)

See [[#@LINE+1]] in existing tests how to attach line/column information properly.

Ignorant thinking out loud, just because i've been thinking about kinda-similar issue:

  1. Where is this fragment going to be added? The alignment will differ per CPU.
  2. Do fragments only have the start point, or can they specify a range of instructions? If they can, then NeverAlign seems contrived to me. Perhaps this should instead be generalized into something like "ensure that this group of instructions doesn't cross ?-byte alignment"?
llvm/include/llvm/MC/MCStreamer.h
838

emit *where*?

Amir added a comment.May 4 2021, 3:39 PM

@lebedev.ri:

Where is this fragment going to be added? The alignment will differ per CPU.

This fragment is inserted before first instruction in macro-fusion pair (cmp instruction in cmp+jcc pair). Modern Intel Cores have a macro-fusion restriction that cmp+jcc shouldn't be split by a cache line boundary. It's OK to for cmp instruction to cross cache line boundary. Not all X86 cores have macro-fusion or this restriction, so the insertion policy and alignment is up to the MC client (BOLT or assembly programmer).

Do fragments only have the start point, or can they specify a range of instructions? If they can, then NeverAlign seems contrived to me. Perhaps this should instead be generalized into something like "ensure that this group of instructions doesn't cross ?-byte alignment"?

This fragment only looks at the subsequent fragment (instruction) in the stream. It doesn't make sense for us to generalize to a range of instructions. There's BoundaryAlign fragment type that can handle alignment boundary crossing for macro-fusion pairs for Intel JCC erratum mitigation, or BundleAlign fragment that can align a group of instructions to the specified alignment (left, right alignment inside bundle, introduced by PNaCl to ensure control-flow integrity). These fragments are more general than NeverAlign, and come with a higher overhead for the client.

llvm/include/llvm/MC/MCObjectStreamer.h
142

"End" is the end of the subsequent fragment that we want _not_ be at a given boundary.
Can be replaced with (arguably) more descriptive like emitAvoidEndAlign, similar to directive name. What do you think?

llvm/include/llvm/MC/MCStreamer.h
838

Thanks for flagging! The wording might be a bit misleading. It'd be more clear to state it as "If the end of the fragment following this NeverAlign fragment ever gets aligned to \p ByteAlignment, this fragment emits a single nop before the following fragment to break this end-alignment."

Amir updated this revision to Diff 342888.May 4 2021, 3:45 PM

Addressed comments by @MaskRay and @lebedev.ri:

  1. Directive parsing error messages
  2. Combined tests into single file
  3. Reworded comment for emitNeverAlignCodeAtEnd
Amir marked 5 inline comments as done.May 4 2021, 3:45 PM

@lebedev.ri:

Where is this fragment going to be added? The alignment will differ per CPU.

This fragment is inserted before first instruction in macro-fusion pair (cmp instruction in cmp+jcc pair).

That much was already clear :)

Modern Intel Cores have a macro-fusion restriction that cmp+jcc shouldn't be split by a cache line boundary. It's OK to for cmp instruction to cross cache line boundary. Not all X86 cores have macro-fusion or this restriction,

so the insertion policy and alignment is up to the MC client (BOLT or assembly programmer).

But this problem exists regardless of BOLT, so shouldn't this also happen without BOLT?

Do fragments only have the start point, or can they specify a range of instructions? If they can, then NeverAlign seems contrived to me. Perhaps this should instead be generalized into something like "ensure that this group of instructions doesn't cross ?-byte alignment"?

This fragment only looks at the subsequent fragment (instruction) in the stream. It doesn't make sense for us to generalize to a range of instructions. There's BoundaryAlign fragment type that can handle alignment boundary crossing for macro-fusion pairs for Intel JCC erratum mitigation, or BundleAlign fragment that can align a group of instructions to the specified alignment (left, right alignment inside bundle, introduced by PNaCl to ensure control-flow integrity). These fragments are more general than NeverAlign, and come with a higher overhead for the client.

I understand that, i just think generalizing a bit may result in a somewhat more generally useful solution.

Amir added a comment.May 4 2021, 11:48 PM

@lebedev.ri:

Where is this fragment going to be added? The alignment will differ per CPU.

Sorry, didn't understand this question then. Do you mean handling of X86 subtargets?

But this problem exists regardless of BOLT, so shouldn't this also happen without BOLT?

Yes, the issue exists regardless of BOLT. Clang currently doesn't ensure the alignment restriction of macro-fusion. I'm proposing a mechanism in D101817 to address that inside Streamer, similar to JCC erratum mitigation. The mechanism can potentially be enabled by the driver in -O3 mode.
But this diff exposes the fragment type used by BOLT where we see entire basic block in advance and control the emitting, so don't have to incur overheads of automatic alignment insertion.

I understand that, i just think generalizing a bit may result in a somewhat more generally useful solution.

What exactly do you think is worth generalizing? If you mean automatic macro-fusion alignment insertion, which would cover both MC clients (BOLT) and compiler, it's proposed in D101817. But unfortunately the overhead in BOLT use case is significantly higher than with NeverAlign fragment insertion.

RKSimon added a subscriber: llvm-commits.
Amir added a comment.May 7 2021, 11:10 PM

@RKSimon, @skan, @MaskRay, @lebedev.ri: I've addressed all comments and suggestions. I suggest that we limit this discussion to NeverAlign fragment, and direct all questions and comments regarding automatic macro-fusion alignment to D101817.
Please let me know if there's anything to change here, otherwise I'd gently ask you to approve.

Amir added a comment.EditedMay 24 2021, 4:29 PM

@skan, @MaskRay: it's been three weeks with no updates. Can we commit this diff?

skan added a comment.May 24 2021, 7:13 PM

@skan, @MaskRay: it's been three weeks with no updates. Can we commit this diff?

LGTM and I already accepted this patch before. @MaskRay requested change on this, he may have more concern.

MaskRay added a comment.EditedMay 25 2021, 10:15 PM

This is performance related. I hope folks like @RKSimon @lebedev.ri can comment.

I've made some code suggestions.

llvm/include/llvm/MC/MCFragment.h
37

Perhaps the name should include X86? Inserting a one-byte nop isn't something any RISC arch can do.

llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp
4733

Are you using the assembly syntax .avoid_end_align in tools? Or is it for testing purpose only?

4737

Delete

4743

if (parseEOL())

4748

expected a positive alignment

4753

delete blank line

Amir updated this revision to Diff 348073.May 26 2021, 1:29 PM
Amir marked 2 inline comments as done.

Addressed comments

Amir marked 4 inline comments as done.May 26 2021, 1:30 PM
Amir added inline comments.
llvm/include/llvm/MC/MCFragment.h
37

NeverAlign fragment emits a minimum-size nop, so the usage is not limited to X86 and can potentially be extended to architectures with fixed-length instructions. Modern ARM and RISC-V cores also have macro-fusion and may also face similar cache line crossing restrictions.

llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp
4733

It's only for testing purposes.

MaskRay added inline comments.May 26 2021, 1:31 PM
llvm/lib/Target/X86/AsmParser/X86AsmParser.cpp
4733

Add a comment to make clear this is not intended for general usage.

Amir updated this revision to Diff 348083.May 26 2021, 1:51 PM

Added a comment re: avoid_end_align directive usage

Amir marked an inline comment as done.May 26 2021, 1:52 PM
RKSimon resigned from this revision.May 27 2021, 2:46 AM
Amir updated this revision to Diff 349409.Jun 2 2021, 3:57 PM

Rebase

Amir added a comment.Jun 2 2021, 3:58 PM

@lebedev.ri: can you please weigh in on this diff?

Amir removed reviewers: hoy, grosbach.Jun 3 2021, 2:30 PM
Amir added a comment.Jun 8 2021, 1:32 PM

@MaskRay: we didn't get more comments in the last two weeks. @skan is the author of a related functionality (BoundaryAlign) and he has accepted this diff. Do you think we can proceed with it?

@lebedev.ri @reames Could you weigh in from performance perspective?

And tag @SjoerdMeijer on non-x86 usability.

@lebedev.ri @reames Could you weigh in from performance perspective?

I do agree that this is needed at least for intel (not quite sure about amd).

And tag @SjoerdMeijer on non-x86 usability.

efriedma added inline comments.
llvm/lib/MC/MCAssembler.cpp
389

Recomputing the size of the fragment here is suspicious. For other fragments, computeFragmentSize is independent from the layout of the section. If the size needs to be changed based on the layout, it's mutated by MCAssembler::layoutSectionOnce.

I realize FT_Org doesn't follow this rule, but that's not a great example to follow. The following currently crashes:

foo: .org bar
bar:
efriedma added inline comments.Jun 8 2021, 4:32 PM
llvm/lib/MC/MCAssembler.cpp
389

My biggest concern with the current approach is the potential for an infinite loop. There isn't any protection against the following fragment's offset moving backwards, which could lead to infinitely oscillating fragment sizes.

efriedma added inline comments.Jun 8 2021, 5:14 PM
llvm/lib/MC/MCAssembler.cpp
389

Thought a little more.

Probably the simplest way to ensure that computeFragmentSize() is sane is to establish the following rule: it should never examine fragments after the current fragment in the section. If we logically need to examine any fragment after the current fragment, we need to do that using relaxation, inside MCAssembler::layoutSectionOnce. This means we can compute the "current" layout using a single pass computeFragmentSize() calls.

Under that rule, FT_Align is fine because it only cares about its own offset. The FT_Org code crashes because it tries to look at fragments after itself. FT_NeverAlign is... marginal. The size of an MCDataFragment is fixed, so that doesn't really count as looking forward. But I suspect MCRelaxableFragment doesn't work right: if we do end up relaxing the fragment, I'm not sure we recompute everything we need to.

Amir added inline comments.Jun 8 2021, 7:10 PM
llvm/lib/MC/MCAssembler.cpp
389

@efriedma: do you have a specific corner case in mind? Wouldn’t it be covered by a test case with MCRelaxableFragment following NA fragment?

efriedma added inline comments.Jun 8 2021, 9:01 PM
llvm/lib/MC/MCAssembler.cpp
389

You'd want a testcase where there's an MCRelaxableFragment that gets relaxed after we computeFragmentSize() access. Something like a forward jump more than 128 bytes. Then make sure we chose the right size for the MCNeverAlignFragment.

It looks like in your regression test, the fragment after is always an MCDataFragment?

Amir added inline comments.Jun 8 2021, 9:55 PM
llvm/lib/MC/MCAssembler.cpp
389

Let me experiment with the idea. The second test is a NeverAlign followed by MCRelaxableFragment (cmpl insn).
Actually we wanted to avoid computing the size using relaxation (as BoundaryAlign does) because it slows down the layout and appears unnecessary if we can handle the RelaxableFragment following NA.

Amir updated this revision to Diff 351058.EditedJun 9 2021, 9:35 PM

@efriedma:
Addressed the concern that NeverAlign may introduce infinite loops in layout,
checked the case if there's an MCRelaxableFragment that gets relaxed after
we invoke computeFragmentSize().

Added the test case confirming that this case is handled correctly:

  1. NeverAlign is added before cmp+jcc, initially no padding.
  2. cmp+jcc are both relaxable, initially short form and are split by an alignment boundary,
  3. Which triggers NeverAlign padding, one byte is added before cmp+jcc.
  4. Which triggers jcc relaxation (pushes out jcc to long form)
  5. Which triggers cmp relaxation (pushes out cmp to long form)
  6. Which makes NeverAlign redundant as cmp+jcc are not longer split by an alignment boundary.
  7. Which disables NeverAlign padding (0 bytes)
  8. Which in theory could shrink cmp+jcc to short forms causing an infinite loop, but as long as relaxation only increases instruction sizes, it converges at this point, leaving cmp and jcc in relaxed form.

Which disables NeverAlign padding (0 bytes)

When do we actually do this computation? At first glance, MCAssembler::layoutSectionOnce never actually goes back to recompute the size of the NeverAlign padding.

Which in theory could shrink cmp+jcc to short forms causing an infinite loop, but as long as relaxation only increases instruction sizes, it converges at this point, leaving cmp and jcc in relaxed form.

That seems right.

Amir added a comment.Jun 10 2021, 1:18 PM

Which disables NeverAlign padding (0 bytes)

When do we actually do this computation? At first glance, MCAssembler::layoutSectionOnce never actually goes back to recompute the size of the NeverAlign padding.

There's one invocation of computeFragmentSize which would set the final size of NeverAlign fragment in MCAssembler::finishLayout, which is performed after layoutSectionOnce/layoutOnce is done.

Which disables NeverAlign padding (0 bytes)

When do we actually do this computation? At first glance, MCAssembler::layoutSectionOnce never actually goes back to recompute the size of the NeverAlign padding.

There's one invocation of computeFragmentSize which would set the final size of NeverAlign fragment in MCAssembler::finishLayout, which is performed after layoutSectionOnce/layoutOnce is done.

That's not going to work out in general. Removing the NeverAlign padding can increase the distance between two symbols if .palign is involved. That might require relaxation.

Amir added a comment.Jun 10 2021, 1:44 PM

That's not going to work out in general. Removing the NeverAlign padding can increase the distance between two symbols if .palign is involved. That might require relaxation.

Do you mean p2align? Yes, removing the NeverAlign padding can increase the distance between two symbols. Let me check if that causes a problem.

@lebedev.ri @reames Could you weigh in from performance perspective?

JFYI, I no longer have access to the infrastructure used to test previous patches in this area. You could consider asking @skatkov if needed.

Drive by thought, not intended to be blocking.

Reading over the description, I'm left wondering why not treat this as a bundling problem? We have precedent for bundles of instructions which need to not be split across an alignment boundary. Why not simply say that the test/jcc are are in a two instruction bundle which can't cross the boundary? In fact, shouldn't this be able to reuse the existing boundary align mechanism pretty much exactly? It seems like the same basic problem, just for a different use case.

Or maybe I'm misunderstanding the problem. Do you get good performance if one of the instructions starts in the first cache line, but ends in the second? That doesn't match my memory of the performance characteristics, but I don't have that fully loaded any more either. That seems to be the difference between the two approaches right?

If that is the difference, and it's intention, I'd suggest updating the commit message to be really explicit about that being the desired behavior in the edge case.

Amir updated this revision to Diff 351623.Jun 11 2021, 7:47 PM

Added Experiment B to address @efriedma comment.

Amir added a comment.Jun 11 2021, 9:04 PM

Removing the NeverAlign padding can increase the distance between two symbols if .palign is involved. That might require relaxation.

I've tried to reproduce this situation, and came up with the following test case: see experiment B in llvm/test/MC/X86/directive-avoid_end_align.s.
It's a bit hard to follow, but please bear with me:
Experiment A:

  1. NeverAlign is added before cmp+jcc, initially no padding.
  2. cmp+jcc are both relaxable, initially short form and are split by an alignment boundary,
  3. Which triggers NeverAlign padding, one byte is added before cmp+jcc.
  4. Which triggers jcc relaxation (pushes out jcc to long form)
  5. Which triggers cmp relaxation (pushes out cmp to long form)
  6. Which makes NeverAlign redundant as cmp+jcc are not longer split by an alignment boundary.
  7. Which disables NeverAlign padding (0 bytes)
  8. Which in theory could shrink cmp+jcc to short forms causing an infinite loop, but as long as relaxation only increases instruction sizes, it converges at this point, leaving cmp and jcc in relaxed form.

Experiment B:

  1. when NeverAlign padding is removed (step 7 in exp A),
  2. causing an increased distance between two symbols (symbols .L53-.L54),
  3. which might require relaxation of cmp instruction at .L52 (imm operand goes from -128 to -129),

this doesn't cause in an incorrect result - cmp instruction at L52 is of correct (relaxed) format and has a correct operand (-129, corresponding to removed NeverAlign padding).
I've made a slideshow to highlight what I think is happening: https://drive.google.com/file/d/1ORpeGj9iK6q4NUi0AOFZW3poAkn-Q-tH/view?usp=sharing

When do we actually do this computation? At first glance, MCAssembler::layoutSectionOnce never actually goes back to recompute the size of the NeverAlign padding.

I'm not sure of the exact mechanism. It might be that layoutSectionOnce/layoutOnce actually recomputes the size of the NeverAlign padding (indirectly). I can try to debug what's going on, but relaxation is a black box to me.

Amir added a comment.Jun 14 2021, 2:07 PM

Drive by thought, not intended to be blocking.

Reading over the description, I'm left wondering why not treat this as a bundling problem? We have precedent for bundles of instructions which need to not be split across an alignment boundary. Why not simply say that the test/jcc are are in a two instruction bundle which can't cross the boundary? In fact, shouldn't this be able to reuse the existing boundary align mechanism pretty much exactly? It seems like the same basic problem, just for a different use case.

Or maybe I'm misunderstanding the problem. Do you get good performance if one of the instructions starts in the first cache line, but ends in the second? That doesn't match my memory of the performance characteristics, but I don't have that fully loaded any more either. That seems to be the difference between the two approaches right?

If that is the difference, and it's intention, I'd suggest updating the commit message to be really explicit about that being the desired behavior in the edge case.

The use case is different: the intent is to prevent the first instruction in a pair ending at a given alignment boundary, by inserting at most one byte. It's OK if either instruction crosses the cache line. The performance metric for this alignment is whether macro-op fusion is performed or not. BOLT aggressively removes nop padding to improve icache/iTLB utilization, so padding both instructions using bundles to not cross the cache line would go against this goal. Performance gain from enabled macro-op fusion in this rare case would be negated by code size increase due to an excessive padding. There's no straightforward way to request instruction bundling to avoid a given end alignment for the first instruction in the bundle.

Following the suggestion by @skan I've experimented with re-purposing BoundaryAlign (D101817) and achieved the desired alignment but this approach has more overhead due to reliance on relaxation (as BoundaryAlign requires in the general case) - see https://reviews.llvm.org/D97982#2710638.

So neither instruction bundling nor BoundaryAlign satisfy our functional and overhead requirements, that NeverAlign addresses. I'll this information to the commit message.

Amir added a comment.Jun 14 2021, 3:34 PM

Which disables NeverAlign padding (0 bytes)

When do we actually do this computation? At first glance, MCAssembler::layoutSectionOnce never actually goes back to recompute the size of the NeverAlign padding.

So it's the case that layoutSectionOnce/layoutOnce actually recomputes the size of the NeverAlign padding indirectly:

Inside layoutSectionOnce, when it relaxes the fragment (relaxFragment), if the fragment is FT_Relaxable, it calls relaxInstruction, which in turn checks fragmentNeedsRelaxation for the passed fragment F, which then goes over F->getFixups, calling evaluateFixup, and so on through Layout.getSymbolOffset(Sym) and Layout.getFragmentOffset, inside the ensureValid check, calls layoutFragment, which then computes fragment size (computeFragmentSize):

(gdb) l
382           return 0;
383         return Size;
384       }
385
386       case MCFragment::FT_NeverAlign: {
>387         const MCNeverAlignFragment &NAF = cast<MCNeverAlignFragment>(F);
388         const MCFragment *NF = F.getNextNode();
389         uint64_t Offset = Layout.getFragmentOffset(&NAF);
390         size_t NextFragSize = 0;
391         if (const auto *NextFrag = dyn_cast<MCRelaxableFragment>(NF)) {
(gdb) bt
#0  llvm::MCAssembler::computeFragmentSize (this=0x10499e0, Layout=..., F=...) at /home/aaupov/local/llvm-project/llvm/lib/MC/MCAssembler.cpp:387
#1  0x0000000000503e54 in llvm::MCAsmLayout::layoutFragment (this=0x7fffffffc4e0, F=0x104e770)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCAssembler.cpp:470
#2  0x0000000000553b8f in llvm::MCAsmLayout::ensureValid (this=0x7fffffffc4e0, F=0x104e9a0)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCFragment.cpp:91
#3  0x0000000000553bc7 in llvm::MCAsmLayout::getFragmentOffset (this=0x7fffffffc4e0, F=0x104e9a0)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCFragment.cpp:97
#4  0x0000000000553ce7 in getLabelOffset (Layout=..., S=..., ReportError=true, Val=@0x7fffffffbc68: 17057240)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCFragment.cpp:111
#5  0x0000000000553d79 in getSymbolOffsetImpl (Layout=..., S=..., ReportError=true, Val=@0x7fffffffbc68: 17057240)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCFragment.cpp:118
#6  0x0000000000553fa8 in llvm::MCAsmLayout::getSymbolOffset (this=0x7fffffffc4e0, S=...)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCFragment.cpp:154
#7  0x00000000005030e2 in llvm::MCAssembler::evaluateFixup (this=0x10499e0, Layout=..., Fixup=..., DF=0x104e470, Target=...,
    Value=@0x7fffffffbdb8: 18446744073709551615, WasForced=@0x7fffffffbdb7: false) at /home/aaupov/local/llvm-project/llvm/lib/MC/MCAssembler.cpp:258
#8  0x00000000005067ff in llvm::MCAssembler::fixupNeedsRelaxation (this=0x10499e0, Fixup=..., DF=0x104e470, Layout=...)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCAssembler.cpp:1026
#9  0x0000000000506994 in llvm::MCAssembler::fragmentNeedsRelaxation (this=0x10499e0, F=0x104e470, Layout=...)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCAssembler.cpp:1045
#10 0x0000000000506a2d in llvm::MCAssembler::relaxInstruction (this=0x10499e0, Layout=..., F=...)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCAssembler.cpp:1055
#11 0x0000000000507664 in llvm::MCAssembler::relaxFragment (this=0x10499e0, Layout=..., F=...)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCAssembler.cpp:1241
#12 0x00000000005077f2 in llvm::MCAssembler::layoutSectionOnce (this=0x10499e0, Layout=..., Sec=...)
    at /home/aaupov/local/llvm-project/llvm/lib/MC/MCAssembler.cpp:1270
Amir edited the summary of this revision. (Show Details)Jun 14 2021, 3:44 PM
Amir marked 4 inline comments as done.

Which disables NeverAlign padding (0 bytes)

When do we actually do this computation? At first glance, MCAssembler::layoutSectionOnce never actually goes back to recompute the size of the NeverAlign padding.

So it's the case that layoutSectionOnce/layoutOnce actually recomputes the size of the NeverAlign padding indirectly:

Inside layoutSectionOnce, when it relaxes the fragment (relaxFragment), if the fragment is FT_Relaxable, it calls relaxInstruction, which in turn checks fragmentNeedsRelaxation for the passed fragment F, which then goes over F->getFixups, calling evaluateFixup, and so on through Layout.getSymbolOffset(Sym) and Layout.getFragmentOffset, inside the ensureValid check, calls layoutFragment, which then computes fragment size (computeFragmentSize):

That looks really fragile: it's depending on subtle details of the way the LastValidFragment cache works. I'd like to ensure the interaction here is more obviously correct.

Amir added a comment.Jun 15 2021, 5:57 PM

@efriedma:

That looks really fragile: it's depending on subtle details of the way the LastValidFragment cache works. I'd like to ensure the interaction here is more obviously correct.

NeverAlign functionality doesn't depend on the details of valid fragment caching. I've followed the debugger a bit more thoughtfully and I see that NeverAlign's fragment size is being updated on a relaxation of subsequent fragment here in MCAsmLayout::layoutFragment: https://github.com/llvm/llvm-project/blob/main/llvm/lib/MC/MCAssembler.cpp#L412.

As a big picture: we're trying to avoid using relaxation at all costs, because macro-op fusion alignment fragments are inserted before every eligible cmp+jcc fragment, which is pretty much every basic block. It's known to significantly increase layout time for us. It's easier to reason about this padding in terms of relaxation, but if it can be avoided, it should be, for practical reasons.

Probably the simplest way to ensure that computeFragmentSize() is sane is to establish the following rule: it should never examine fragments after the current fragment in the section. If we logically need to examine any fragment after the current fragment, we need to do that using relaxation, inside MCAssembler::layoutSectionOnce. This means we can compute the "current" layout using a single pass computeFragmentSize() calls.

I checked the implementation here too, and, indeed, when there is a RelaxableFragment that was relaxed, it gets invalidated by marking "last valid fragment" as its predecessor.

This forces the assembler to call "layoutFragment" on that exact relaxable fragment, which in turn will always ask the predecessor to compute its size.

So the bottom line is: if you have a fragment like neveralign that depends on the size of its immediate successor, that's fine, but if it depends on information of any other successor, then, and only then, you need to implement it as a relaxable fragment.

Amir added a comment.Jun 22 2021, 1:53 PM

@efriedma: did you have a chance to look at test cases and comments above?

NeverAlign functionality doesn't depend on the details of valid fragment caching. I've followed the debugger a bit more thoughtfully and I see that NeverAlign's fragment size is being updated on a relaxation of subsequent fragment here in MCAsmLayout::layoutFragment: https://github.com/llvm/llvm-project/blob/main/llvm/lib/MC/MCAssembler.cpp#L412.

Oh, hmm, I see. I guess in that case, it should be reliable; I'm okay with leaving it the way it is. Maybe add a comment to computeFragmentSize explaining that this is happening, though.

Amir updated this revision to Diff 354124.Jun 23 2021, 6:13 PM
Amir edited the summary of this revision. (Show Details)

Added a comment in computeFragmentSize with the mechanism how NeverAlign
fragment size is updated on a relaxation of the successor fragment.

Amir added a comment.Jun 23 2021, 7:41 PM

@efriedma: thanks for a good inquisitive review and bringing up the question of NA size recomputation. Would you care to accept this diff if you think all concerns are addressed?

Amir added a comment.Jul 6 2021, 12:14 PM

There were no new unaddressed comments or issues regarding this diff. Can it be accepted and landed? @skan

skan accepted this revision.Jul 6 2021, 3:11 PM
MaskRay added a comment.EditedJul 6 2021, 3:15 PM

Sorry for being slow on the review, but you may have noticed that people have been reluctant to accept this, likely because they are unsure whether NeverAlign is the correct design here. So hope you will not land this as is (This is what the state "Needs Review" intends).

Amir added a comment.Jul 6 2021, 4:04 PM

@MaskRay: thank you for getting back to it. I understand the review process, but without an actionable feedback it's hard to move forward. I also understand the fatigue of this back and forth for reviewers. Thus I appreciate your getting back to this diff.
Agree that this solution may appear complicated and redundant wrt other (existing, simpler) mechanisms. However, the drawbacks of other solutions were discussed here and are outlined in the summary. Honest question: what other collateral do you think could show it's the right design?

Matt added a subscriber: Matt.Jul 13 2021, 7:10 AM