Page MenuHomePhabricator

Align non-fused branches within 32-Byte boundary (basic case)
AbandonedPublic

Authored by reames on Dec 9 2019, 5:11 PM.

Details

Summary

This patch is derived from D70157. Changes from that one:

  • Remove prefix padding and simply pad with variable lengths nops. This reduces the complexity sharply. We can and should return to prefix padding in a follow up patch.
  • Remove fused instruction handling. Needs to be reimplemented, but not having it makes the mechanics of the basic padding requirements more straight-forward.
  • Rename MCMachineDependentFragment to MCBoundaryAlignFragment and try to clarify (in comments and asserts) associated invariants and expectations.
    • Remove master flag as user expectations (which are set by our patch and the corresponding gcc patch) would not be met by only aligning a subset of branches.

The intent here is to create something straight forward and easy to review. If others agree with the incremental approach, the original patch can be incrementally reimplemented on top of this one.

Diff Detail

Event Timeline

reames created this revision.Dec 9 2019, 5:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2019, 5:12 PM
skan added a reviewer: skan.Dec 9 2019, 5:41 PM
skan added a comment.Dec 9 2019, 5:46 PM

The patch doesn't handle with the hard code case either, but as far as I can see that change was not mentioned in the description of this patch.

reames added a comment.Dec 9 2019, 6:42 PM

The patch doesn't handle with the hard code case either, but as far as I can see that change was not mentioned in the description of this patch.

Unless I'm misreading the original patch, the notion of "hard code" is only relevant to the prefix padding scheme. Admittedly, I might be misreading, it isn't clearly stated anywhere in the original patch exactly what "hard code" is or what purpose it serves. .

skan added a comment.Dec 9 2019, 7:08 PM

The patch doesn't handle with the hard code case either, but as far as I can see that change was not mentioned in the description of this patch.

Unless I'm misreading the original patch, the notion of "hard code" is only relevant to the prefix padding scheme. Admittedly, I might be misreading, it isn't clearly stated anywhere in the original patch exactly what "hard code" is or what purpose it serves. .

In function X86AsmBackend::alignBranchesBegin, I added the comment

// The prefix or nop isn't inserted if the previous item is hard code, which
// may be used to hardcode an instruction, since there is no clear instruction
// boundary.

I'm sorry if I didn't make it clear. As far as i am concerned, I think the hard code case should be dealt with.

reames added a comment.Dec 9 2019, 7:26 PM

The patch doesn't handle with the hard code case either, but as far as I can see that change was not mentioned in the description of this patch.

Unless I'm misreading the original patch, the notion of "hard code" is only relevant to the prefix padding scheme. Admittedly, I might be misreading, it isn't clearly stated anywhere in the original patch exactly what "hard code" is or what purpose it serves. .

In function X86AsmBackend::alignBranchesBegin, I added the comment

// The prefix or nop isn't inserted if the previous item is hard code, which
// may be used to hardcode an instruction, since there is no clear instruction
// boundary.

I'm sorry if I didn't make it clear. As far as i am concerned, I think the hard code case should be dealt with.

I'm still not following. Can you explain the use case here? What test case differs based on the presence of the "hard code" support in the original patch?

skan added a comment.Dec 9 2019, 7:54 PM

The patch doesn't handle with the hard code case either, but as far as I can see that change was not mentioned in the description of this patch.

Unless I'm misreading the original patch, the notion of "hard code" is only relevant to the prefix padding scheme. Admittedly, I might be misreading, it isn't clearly stated anywhere in the original patch exactly what "hard code" is or what purpose it serves. .

In function X86AsmBackend::alignBranchesBegin, I added the comment

// The prefix or nop isn't inserted if the previous item is hard code, which
// may be used to hardcode an instruction, since there is no clear instruction
// boundary.

I'm sorry if I didn't make it clear. As far as i am concerned, I think the hard code case should be dealt with.

I'm still not following. Can you explain the use case here? What test case differs based on the presence of the "hard code" support in the original patch?

    .text
    nop
.Ltmp0:
    .p2align 3, 0x90
    .rept 16
    nop
    .endr
.Ltmp3:
    movl  %eax, -4(%rsp)
    .rept 2
    .byte 0x2e
    .endr
    jmp .Ltmp0

The prefix .rept 2 .byte 0x2e .endr is used to prefix instruction jmp .Ltmp0. We shouldn't add a nop before the this jump, since it makes .rept 2 .byte 0x2e .endr prefix instruction nop.

In function X86AsmBackend::alignBranchesBegin, I added the comment

// The prefix or nop isn't inserted if the previous item is hard code, which
// may be used to hardcode an instruction, since there is no clear instruction
// boundary.

I'm sorry if I didn't make it clear. As far as i am concerned, I think the hard code case should be dealt with.

I'm still not following. Can you explain the use case here? What test case differs based on the presence of the "hard code" support in the original patch?

Consider hand-written assembly code which looks like this:

	 .byte	0x67
	 mov	%rdx, %rax

That is from boringssl, fwiw, https://github.com/google/boringssl/blob/master/crypto/fipsmodule/bn/asm/rsaz-avx2.pl.

Or, an example from Mesa xform4.S:

p4_general_done:
	.byte 0xf3
	ret

Inserting anything between the tacked-on-prefix with .byte, and the rest of the instruction written mnemonically, would be bad.

That said...I don't like the "hard-code" code. It feels like a workaround for the root issue -- that this feature represents a big shift in what x86 assemblers might do to your code, and thus should be opt-in by asm writers, and not enabled by any global flag.

skan added inline comments.Dec 9 2019, 8:30 PM
llvm/lib/MC/MCAssembler.cpp
1007–1008

It seems that you didn't simply the code based on my latest patch, I have removed the class/function name in my comment.

skan added inline comments.Dec 9 2019, 8:50 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
394–395

There is a deficiency here. For example, there may be a MCAlignFragment fragment between the MCBoundaryAlignFragment and the branch to be aligned, which will result an infinite loop.

I have a general comment. I think the macro-fused jcc instructions should be taken into account even for NOP padding. From our experience, it's better to take the macro-fused instructions as an integrated instruction, and ensure it not cross or against 32-byte boundary by padding if necessary.

Given the lack of macro-fused jcc and hard copy handling in this patch, I would suggest that we base on the previous patch and split it into two, one for NOP padding and the other for prefix padding. I suppose the first one is what you desire to have. And it should cover more corner cases we met and resolved before. What do you think?

@skan

I just want to mention here, that my comment on the original patch still stands and really needs to be addressed:

We need to have detailed performance and binary size numbers for this kind of change. We should be looking at the test suite and SPEC for performance. I'd like to see before/after performance numbers on a system that is *not* using the microcode feature that addresses jCC (either a newer CPU, an older CPu, or an unpatched CPU). I'd also like to see before/after performance numbers on a system that *is* using the microcode feature to address the jCC issue. I'd suggest Clang itself, Firefox, and Chromium as binaries to show the binary size impact of this patch. Happy for others to suggest binaries to use to evaluate the binary size impact.

I continue to think that any patch we plan to land should have these numbers, and each incremental update to that patch should have the updated numbers. Otherwise, there is no way to do a good job reviewing the *impact* of these changes even after folks are satisfied with the *code* in these changes.

In function X86AsmBackend::alignBranchesBegin, I added the comment

// The prefix or nop isn't inserted if the previous item is hard code, which
// may be used to hardcode an instruction, since there is no clear instruction
// boundary.

I'm sorry if I didn't make it clear. As far as i am concerned, I think the hard code case should be dealt with.

I'm still not following. Can you explain the use case here? What test case differs based on the presence of the "hard code" support in the original patch?

Consider hand-written assembly code which looks like this:
...
Inserting anything between the tacked-on-prefix with .byte, and the rest of the instruction written mnemonically, would be bad.

Ok, this makes more sense now.

That said...I don't like the "hard-code" code. It feels like a workaround for the root issue -- that this feature represents a big shift in what x86 assemblers might do to your code, and thus should be opt-in by asm writers, and not enabled by any global flag.

Honestly, I both strongly agree with this, and see the argument as a somewhat lost cause. I think in practical terms, we *will* have to support something which works for both compiler generated assembly and legacy hand written assembly. I dislike that fact and the design it encourages, but it's reality.

On the hard code notion, after thinking about it over night, I suggest that we do *not* include it in the first patch.

First, exactly what "hard code" is has not been settled and is worthy of careful discussion. For instance, I see no particular reason why the "manual prefix" trick described here should be supported, but the assumptions about label alignment mentioned in the previous thread shouldn't be. There is room for discussion here, and if we block all progress on settling this, the review will stall.

Second, there are use cases for which the "hard code" support is unneeded. For instance, compiler generated assembly which doesn't do the silly "manual prefix" trick. Supporting them while we work through the support needed for opt in legacy assembly seems worthwhile.

I just want to mention here, that my comment on the original patch still stands and really needs to be addressed:

We need to have detailed performance and binary size numbers for this kind of change. We should be looking at the test suite and SPEC for performance. I'd like to see before/after performance numbers on a system that is *not* using the microcode feature that addresses jCC (either a newer CPU, an older CPu, or an unpatched CPU). I'd also like to see before/after performance numbers on a system that *is* using the microcode feature to address the jCC issue. I'd suggest Clang itself, Firefox, and Chromium as binaries to show the binary size impact of this patch. Happy for others to suggest binaries to use to evaluate the binary size impact.

I strongly agree that Intel should share these numbers in detail. So far, we've seen some summary shared on llvm-dev, but nothing in detail.

I can't currently share the details of our internal runs, but let me summarize takeaways:

  • All results are very preliminary - this is all from a single machine with a subset of our workloads.
  • Applying microcode without mitigation hurts. Geomean impacts aren't too terrible (5-10%), but some individual workloads suffer badly.
  • Apply mitigation (either nop padding or prefix padding) addresses most of the regressions seen. Not all; further work is needed to understand the outliers, but it's a definite step forward.
  • The difference between nop and prefix padding is in the noise for the current measurements. (Noise is fairly high though, so take that with a grain of salt.)
  • Applying mitigations on an unpatched machine does have a small negative impact. For us, it appears to be around a -1% geomean impact. (Again, *very* noisy results so take with a grain of salt.)
  • Reminder: This is all in the context of a JIT targetting the specific machine architecture run on. That may be important w.r.t. nop padding cost for instance.

I continue to think that any patch we plan to land should have these numbers, and each incremental update to that patch should have the updated numbers. Otherwise, there is no way to do a good job reviewing the *impact* of these changes even after folks are satisfied with the *code* in these changes.

I understand your concern, but I want to caution drawing too firm a line here. We have a situation where review on a very important patch is stalled, and we need to unblock it. This subseting was my attempt to do so, but I'm not in a position to be able to share data at the scale your arguing for. If we can't work incrementally here, I suspect this effort is going to stall.

For context, we will be shipping a mitigation for this. I would strongly prefer that be based on upstream work, but if I have to, we'll diverge and drop from the upstream discussion.

Also, to be careful, I think there's a huge difference between landing infrastructure (this patch) and enabling by default. This patch does not do the later, and I'm not sure I support *ever* doing so. That's for later discussion.

Honestly, I both strongly agree with this, and see the argument as a somewhat lost cause. I think in practical terms, we *will* have to support something which works for both compiler generated assembly and legacy hand written assembly. I dislike that fact and the design it encourages, but it's reality.

Hand-written assembly is generally responsible for its own performance optimizations, and we do not usually expect the assembler to muck around with the code. So, why is it assumed that _this_ optimization should be done automatically on such assembly code?

emaste added a subscriber: emaste.Dec 10 2019, 1:35 PM

Honestly, I both strongly agree with this, and see the argument as a somewhat lost cause. I think in practical terms, we *will* have to support something which works for both compiler generated assembly and legacy hand written assembly. I dislike that fact and the design it encourages, but it's reality.

Hand-written assembly is generally responsible for its own performance optimizations, and we do not usually expect the assembler to muck around with the code. So, why is it assumed that _this_ optimization should be done automatically on such assembly code?

The more I think about this, the more I think James is right here.

Regardless of whether he is or not, we're clearly going to need an opt in assembler syntax for this at some point. Given that, I'm wondering if my time wouldn't be better spent there. The majority of the code would be shared with this patch, and it defers the controversy around optimizing assemblers until later. If we get that piece done, we can build either compiler or optimizing assembler (or both) support on top.

I've gone as far as writing a rough draft of textual assembler support. If folks agree this is helpful, I'll abandon this patch, and post one in that direction. Any high level suggestions as to syntax? I see two major options, but definitely welcome suggestions w.r.t. naming/semantics/etc...

Option 1

.boundary_align 4
jmp foo

and

.boundary_align 4
.bundle_lock
test ...
jcc foo
.bundle_lock

Option 2

Option 1

.boundary_align 4
jmp foo
.end_boundary_align

and

.boundary_align 4
test ...
jcc foo
.end_boundary_align

p.s. If anyone has a better name than "boundary align" please suggest it. That's not a good name; it's just a placeholder.

I've gone as far as writing a rough draft of textual assembler support.

I posted the WIP draft here (https://reviews.llvm.org/D71315). Please keep the discussion here or the original thread. I don't want to fork discussion further. Once we've settled direction, I can cleanup and repost the patch if needed.

I've gone as far as writing a rough draft of textual assembler support. If folks agree this is helpful, I'll abandon this patch, and post one in that direction. Any high level suggestions as to syntax? I see two major options, but definitely welcome suggestions w.r.t. naming/semantics/etc...

Option 1

.boundary_align 4
jmp foo

and

.boundary_align 4
.bundle_lock
test ...
jcc foo
.bundle_lock

Option 2

Option 1

.boundary_align 4
jmp foo
.end_boundary_align

and

.boundary_align 4
test ...
jcc foo
.end_boundary_align

p.s. If anyone has a better name than "boundary align" please suggest it. That's not a good name; it's just a placeholder.

Do we need some option or syntax to disable ".boundary_align 4", so that the same code can be compiled for some machine that do not need align branch?

Honestly, I both strongly agree with this, and see the argument as a somewhat lost cause. I think in practical terms, we *will* have to support something which works for both compiler generated assembly and legacy hand written assembly. I dislike that fact and the design it encourages, but it's reality.

Hand-written assembly is generally responsible for its own performance optimizations, and we do not usually expect the assembler to muck around with the code. So, why is it assumed that _this_ optimization should be done automatically on such assembly code?

I think we need to think about those legacy assembly codes, and libraries written in assembly etc. If they are impacted by the microcode update, the assembler can provide a quick way to mitigate it w/o any code change. Why don't we do it? At least we can give users the option to use it or not.

Do we need some option or syntax to disable ".boundary_align 4", so that the same code can be compiled for some machine that do not need align branch?

I wouldn't think so. Macros could be used for this purpose if needed. We - to my admittedly limited knowledge - don't have anything analogous for other forms of alignment for instance.

bcain added a subscriber: bcain.Dec 12 2019, 7:15 AM

I'm late jumping into the discussion of these patches and may not have seen all comments, so go easy on me if I say something that has already been covered.

I just want to mention here, that my comment on the original patch still stands and really needs to be addressed:

We need to have detailed performance and binary size numbers for this kind of change. We should be looking at the test suite and SPEC for performance. I'd like to see before/after performance numbers on a system that is *not* using the microcode feature that addresses jCC (either a newer CPU, an older CPu, or an unpatched CPU). I'd also like to see before/after performance numbers on a system that *is* using the microcode feature to address the jCC issue. I'd suggest Clang itself, Firefox, and Chromium as binaries to show the binary size impact of this patch. Happy for others to suggest binaries to use to evaluate the binary size impact.

In D71238#1777747, @reames wrote:
I strongly agree that Intel should share these numbers in detail. So far, we've seen some summary shared on llvm-dev, but nothing in detail.

We ran into a very slow required internal process here. Annita tells me that we'll have data to share very soon, hopefully on Monday. I haven't seen the data myself, but I have a suspicion that it isn't going to be what you're looking for. I'm talking to people to see what I can do to establish a faster process for turning around data for incremental updates to these patches. More detailed numbers mean more scrutiny before we're allowed to share. I like the suggestion of using clang. A self build would give us code size and an execution time that I hope would be good enough for evaluating relative differences introduced by the patch. All interested reviewers should easily be able to repeat this test locally on whatever machine they like, and everyone could share their own data from whatever variations they were interested in trying.

In D71238#1776751, @chandlerc wrote:

I continue to think that any patch we plan to land should have these numbers, and each incremental update to that patch should have the updated numbers. Otherwise, there is no way to do a good job reviewing the *impact* of these changes even after folks are satisfied with the *code* in these changes.

In D71238#1777747, @reames wrote:
I understand your concern, but I want to caution drawing too firm a line here. We have a situation where review on a very important patch is stalled, and we need to unblock it. This subseting was my attempt to do so, but I'm not in a position to be able to share data at the scale your arguing for. If we can't work incrementally here, I suspect this effort is going to stall.

I agree with Philip here. Obviously, having data is critical to making decisions and knowing when we've reached an acceptable state. We just need to agree on where the threshold is for enough data to make a reasonable decision.

I just want to mention here, that my comment on the original patch still stands and really needs to be addressed:

We need to have detailed performance and binary size numbers for this kind of change. We should be looking at the test suite and SPEC for performance. I'd like to see before/after performance numbers on a system that is *not* using the microcode feature that addresses jCC (either a newer CPU, an older CPu, or an unpatched CPU). I'd also like to see before/after performance numbers on a system that *is* using the microcode feature to address the jCC issue. I'd suggest Clang itself, Firefox, and Chromium as binaries to show the binary size impact of this patch. Happy for others to suggest binaries to use to evaluate the binary size impact.

I continue to think that any patch we plan to land should have these numbers, and each incremental update to that patch should have the updated numbers. Otherwise, there is no way to do a good job reviewing the *impact* of these changes even after folks are satisfied with the *code* in these changes.

More performance data was posted on http://lists.llvm.org/pipermail/llvm-dev/2019-December/137609.html and http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html. Let's move on based on the performance data.

I do like the idea of generalizing bundle_lock to mean generally "keep these instructions together and don't introduce anything extraneous in the middle". So, ".boundary_align <argument>" would apply to the next instruction-or-instruction-bundle, and will emit nops at its location, such that the next instruction is guaranteed to have the proper "within but not ending at a given 2**<arg> block" alignment.

This has the advantage that padding is only added to locations that explicitly ask for it -- unmodified assembly code will not be broken. (BTW, not breaking existing assembly code by introducing unexpected padding seems pretty important IMO, which is why I keep coming back to it. If we don't figure that out, I don't think we can enable the feature by default, which we will probably want to do in certain tunings.)

The disadvantage would be that the compiler needs to know exactly which instructions to request padding for.

But now I'm pondering if we may wish to leave space in our design to be able to avoid the other possible fallbacks out of the DSB. (E.g., avoiding having more than 3 jumps or 18 uops per 32-byte block of code.) I don't mean to suggest to implement any of those optimizations now, only that it's worth considering how we could implement that, as a potential future enhancement.

And, thinking about that, I can't really see a way to implement any of that with explicit per-instruction directives generated by the compiler. But, it seems like it would fit very well with a region-based mode. So, considering that potential future extension, I'm currently thinking that something similar to D71238, but with a directive to opt-in rather than command-line all-or-nothing, would be a great first-step.

And, we need a region-based annotation to mark which instructions can get prefix-padded, too.

So, compiler would request (let's say, with ".autopadding" and ".noautopadding") this mode for code where it's safe, and profitable to do. This would declare to the assembler that nop or prefix padding may be added as necessary before any instruction within the region to keep the instructions within whatever target microarchitectural constraints exist for the given architecture.

I imagine it not indicating any particular padding action be taken, only the opt-in for the assembler being _allowed_ to insert nops or padding.

So, my suggestions for next steps to take are:

  • Add support for such a directive into this (D71238, basic case) patch.
  • Add support for emitting it to clang. By default, I'd think it should be enabled around code clang emits, except for cold code, inline assembly, and patchable instructions emitted for PATCHPOINT, STATEPOINT, xray.
  • Incrementally extend support to use prefix-padding, and any other such future padding smarts we want.

Unfortunately, I need to disappear right now, will not be able to join the zoom call. Sorry about that!

reames abandoned this revision.Dec 16 2019, 5:43 PM

Abandoning this patch. After call, we're consolidating back on the original patch now that direction has been agreed on.