Page MenuHomePhabricator

Use 15 byte long nops on modern Intel processors
ClosedPublic

Authored by reames on Tue, Mar 10, 10:31 AM.

Details

Summary

Back in D42616, we switched our default nop length from 15 to 10 bytes because some platforms have painful decode stalls when encountering multiple instruction prefixes. (10 byte long nops come from the fact that prefixes are used to pad after 8 bytes, and some platforms have issues w/more than two prefixes.)

Based on Agner's guides, it appears to be the case that modern Intel (SandyBridge and later) can decode an arbitrary number of prefixes without issue. Intel's guide only provides up to 9 bytes; I read that as providing a safe default for all their chips. Older chips and Atom series have serious decode stalls. I can't find a conclusive reference beyond those two.

Diff Detail

Event Timeline

reames created this revision.Tue, Mar 10, 10:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptTue, Mar 10, 10:31 AM

Does anyone have any comments?

Looks good to me.

reames added a comment.EditedThu, Mar 12, 5:36 PM

I did some quick perf testing on an ivybridge (my old laptop). As expected there's no obvious difference between the 10 byte and 15 byte variants. This is what we'd expect if there's no decoder stall encountered. The only variant notably slower is the trivial 1 byte nop repeated 15 times.

current_nop:

nopw %cs:0L(%rax,%rax,1)
nopl 0(%rax,%rax,1)
ret

proposed_nop:

.rept 5
.byte 0x66
.endr
nopw %cs:0L(%rax,%rax,1)
ret

repeat_nop:

.rept 15
nop
.endr
ret

(Edit - none of the variants actually show a notable difference after I fixed an alignment issue)

skan added subscribers: gpei, wxiao3.Thu, Mar 12, 7:03 PM
This revision was not accepted when it landed; it landed in state Needs Review.Fri, Mar 13, 11:18 AM
This revision was automatically updated to reflect the committed changes.
pcordes added a subscriber: pcordes.EditedSun, Mar 15, 3:48 PM

Looks good to me. We could even bump up the max NOP length for "generic" CPUs without fast 15-byte NOPs.

For 32 and 64-bit code with default tune=generic, I think it makes sense to care about the 3 prefix limit (including the 0F escape byte) of Silvermont as a lowest common denominator baseline for a NOP strategy. i.e. a max length of 11 bytes, with two 0x66 operand-size prefixes instead of one, and the 0F escape byte.

When it won't increase the total number of NOPs, it might make sense to aim for a max NOP length of 8-bytes for generic or Silvermont/Goldmont tuning. But if we can reach the desired alignment boundary with fewer longer NOPs, do that even with tune=silvermont. (IFETCH / pre-decode vs. decode block boundaries are complicated and hard to keep straight which CPUs do what. It's really nice that mainstream CPUs have uop caches these days so we don't usually have to tune for that in small loops.)

CPUs like Sandybridge can pre-decode 16 bytes or 4 instructions per clock (or something like that) so possibly choosing the length of the first NOP to make a total of 16 bytes starting from 3 instructions ago, or from the last 2-uop instruction, could make sense if anyone wants to play with that idea.

But if we're going to care about surrounding code when emitting padding, we should just make previous instructions longer so we can use fewer or no actual NOP instructions:
https://stackoverflow.com/questions/48046814/what-methods-can-be-used-to-efficiently-extend-instruction-length-on-modern-x86


Intel Sandybridge-family (and P6-family since Core 2) have no problems with extra prefixes; even 66 ... 90 would be fine on those CPUs. The length-finding pre-decode stage handles these as efficiently as any other 15-byte instruction. And once they're cached in the uop cache, the whole NOP only uses 1 entry in a uop cache line. That all applies Zen, too: According to Agner Fog, it has no penalty for instructions with many prefixes. (See Agner Fog's microarch pdf https://www.agner.org/optimize/microarchitecture.pdf)

For tune=sandybridge, 66 ... 90 is actually *better* than a standard long NOP for any length from 3 to 15 bytes. But IvyBridge fixed that so I don't think it's worth the extra code complexity to implement that micro-optimization for ~8 year old hardware that's getting replaced. It usually won't matter for overall performance, and is irrelevant when the code runs from the uop cache so it's hard for a corner case to make it a disproportionate part of a small loop. Agner Fog says:

The multi-byte NOP instruction with opcode 0F 1F can only be decoded at the first of the
four decoders on Sandy Bridge, while a simple NOP with extra prefixes (opcode 66 66 90)
can be decoded at any of the four decoders. The Ivy Bridge does not have this limitation.
Both types of long NOPs are decoded at a rate of four per clock on Ivy Bridge.

Bulldozer-family can handle up to 3 prefixes (on any instruction including NOP).

AMD Bobcat/Jaguar has no penalty for multiple prefixes, neither does Via Nano.

Intel Silvermont-family (including KNL) takes an extra 4 to 6 cycles decoding instructions with more than 3 prefixes (for Silvermont that even includes the 0F escape byte that's part of 0F 1F nopl, unlike most CPUs with prefix-decoding limits). Also, Instructions longer than 8 bytes have a throughput of only one instruction per clock cycle so possibly 8-byte nops are the sweet spot here for large amounts of padding? IDK. Probably you don't want to ever pad more than 16 bytes when tuning specifically for Silvermont. Goldmont raised the limit from 3 to 4, but "A few combinations of three prefixes and escape bytes also cause high delays,
including the combination of F3 with 66 or REX prefixes." SMont-family has no uop cache, but does have a loop buffer. Although that probably doesn't work for nested loops, and we wouldn't want to put padding inside an inner-most loop (even for possible branch targets).

Agner says Atom has "severe penalties" for more than 3 prefixes. (Instructions with up-to-3 prefixes are decoded at 2/clock, like Silvermont.) He doesn't mention 0F escape bytes counting as prefixes for Atom, only Silvermont.

Intel Pentium-M and earlier decode slower with more than 1 prefix. (This is presumably why P6-family introduced 0F 1F mod/rm long NOPs in the first place, so the padding could be in addressing modes, not prefixes). They don't support 64-bit mode, and even for 32-bit code we probably shouldn't care *at all* about tuning for them except with tune=pentium-m or pentium2/3. P4 / P4E also have a 1 or 2 prefix limit for efficient decode when building trace cache. P4 Nocona is x86-64, but tune=generic should definitely not care about it.


Agner Fog's microarch PDF says for Core2/Nehalem:

Previous processors had a limitation on the number of instruction prefixes that the decoders
can handle per clock cycle. The Core2 and Nehalem have no such limitation. Instructions
with any number of prefixes can be decoded by any of the four decoders in one clock cycle.
The only limitation is set by the instruction set definition which limits the length of instruction
plus prefixes to 15 bytes. Thus, it is possible to decode a one-byte instruction with 14
prefixes in a single clock cycle. No instruction needs so many prefixes, of course, but
redundant prefixes can be used instead of NOP's as fillers for aligning a subsequent loop
entry. See manual 2: "Optimizing subroutines in assembly language" for a discussion of
redundant prefixes.

For Bulldozer-family he says:

Instructions with up to three prefixes can be decoded in one clock cycle. There is a very
large penalty for instructions with more than three prefixes. Instructions with 4-7 prefixes
take 14-15 clock cycles extra to decode. Instructions with 8-11 prefixes take 20-22 clock
cycles extra, and instructions with 12-14 prefixes take 27 - 28 clock cycles extra. It is
therefore not recommended to make NOP instructions longer with more than three prefixes.
The prefix count for this rule includes operand size, address size, segment, repeat, lock,
REX and XOP prefixes. A three-bytes VEX prefix counts as one, while a two-bytes VEX
prefix does not count. Escape codes (0F, 0F38, 0F3A) do not count.


(Edit - none of the variants actually show a notable difference after I fixed an alignment issue)

TL:DR: you hid the front-end decode/issue cost behind call/ret overhead.

Although with an unrolled loop doing calls, I can measure some overhead on Skylake for 15x single-byte NOP vs. the other 2 options. Like 5G vs. 4.2G cycles per 1G calls in an unrolled calling loop with times 100 call / dec/jnz. But no difference measured from calling a bare ret for one or two NOPs (long or short doesn't matter).

If you want to test front-end throughput cost, putting a small amount of code in a function with call / ret overhead is a poor choice. Agner Fog lists a throughput of 2 cycles for each of call and ret on their own (maybe somehow avoiding mispredicts for ret, unlike how https://uops.info/ tested). Or perhaps he just tested calling a ret and split up the cost between them. On Skylake, an unrolled loop doing 100x call instructions to the same ret instruction gives me ~4.2 cycles per call/ret pair. (And significant counts for dsb2mite_switches.penalty_cycles:u). I used the NASM test loop from https://stackoverflow.com/questions/49203772/does-a-series-of-x86-call-ret-instructions-form-a-dependent-chain with the memory operations in the called function removed. (I remembered I'd done some kind of call/ret test before so I went looking for that.)

(call and ret are 2 uops each, and NOP is 1 uop each, so a ~4 cycle call/ret pair gives time for 12 other uops to issue in its shadow.)

A better test would be in a loop that already bottlenecks on front-end throughput, e.g. *just* those sets of NOPs (repeated 15 times each so we can see the true cheapness of huge nops) inside a dec/jnz loop. That's not exactly representative; in real use you'd have one or two long NOPs following shorter instructions, not 5 or 10 of them back-to-back. On Skylake CPUs (with loop buffer disabled by a microcode update) the 15-byte NOPs will probably bottleneck on fetching from the uop cache because one uop cache way/line can only cache uops from a 32-byte aligned block of x86 machine code. So we'd expect about 2 or 3 uops per cache line instead of 6.

On CPUs with a working LSD (pre-Skylake, and Coffee Lake and later), looping on 15-byte NOPs should run at the full 4 uops per clock issue/rename limit (including the dec/jnz loop uop; that's why I suggested unrolling by 15 so we'd have a multiple of the pipeline width). Ice Lake is 5-wide, Zen is 6 uops / 5 instructions wide.

It's obvious that single-byte NOPs are sub-optimal, but besides the obvious effects, lots of small instructions can stop the uop cache from being able to hold all the uops, leading to extra DSB2MITE switches on Intel CPUs, stopping nearby code from benefiting from the uop cache and causing DSB2MITE switch penalties. Making NOPs even longer does not have any risk like that for CPUs with uop caches.