This is an archive of the discontinued LLVM Phabricator instance.

[X86] Emit 11-byte or 15-byte NOPs on recent AMD targets, else default to 10-byte NOPs (PR22965)
ClosedPublic

Authored by RKSimon on Jan 27 2018, 1:42 PM.

Details

Summary

We currently emit up to 15-byte NOPs on all targets (apart from Silvermont), which stalls performance on some targets with decoders that struggle with 2 or 3 more '66' prefixes.

This patch flags recent AMD targets (btver1/znver1) to still emit 15-byte NOPs but changes the default to 10 bytes for all other targets.

I'd like to add a 11-byte NOP feature as well for AMD bdver targets, but I'm not sure where we are with feature flag usage - although I could technically identify them with the FeatureXOP instead if that is OK?

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Jan 27 2018, 1:42 PM
pcordes accepted this revision.Jan 27 2018, 11:50 PM

15-byte NOP should be ok for VIA Nano2000/3000 as well, if anyone still cares about them: Agner says "Instructions with any number of prefixes are decoded without delay." Nano3000 can even fuse a NOP with the preceding instruction, so it doesn't use any extra resources in the execution units.

While we're looking at NOP strategies; how hard would it be to get the built-in assembler to pad other instructions instead of just adding extra NOP padding? https://stackoverflow.com/questions/48046814/what-methods-can-be-used-to-efficiently-extend-instruction-length-on-modern-x86

e.g. use a disp32 or imm32 when you could have used an imm8 or disp8 (or no displacement). You'd want to be careful you didn't slow down decoding, and only pad within the last basic block before the alignment boundary. (i.e. don't insert padding that moves a branch or branch target forwards, except for the branch target you're trying to align). Intel's uop cache compresses constants according to their actual value, (64-bit, 32-bit, or 16-bit), not how they were encoded in the x86 instruction, so disp32 + imm32 doesn't make it worse for the uop cache than disp8 + imm8 in one instruction if the constants are the same.

Even if you don't reach the exact boundary you want, it could mean fewer NOPs, especially on targets without 15-byte NOPs.

lib/Target/X86/X86.td
880

Is this only setting the flag for btver2? Bobcat (btver1)'s decoders are also fine with repeated prefixes, and I don't see that getting set anywhere.

This revision is now accepted and ready to land.Jan 27 2018, 11:50 PM

I can add the feature flag to Bobcat no problem - although the only place I can find any confirmation that its OK is in Agner's microarch doc. We don't have the Via Nano cpu targets any more.

What about the 11-byte NOPs for bdver - worth doing even though we'll either need to hijack FeatureXOP or use up another feature bit?

Not sure on padding through prefix/imm-extension - @craig.topper @spatel any ideas? We've recently been trying harder to reduce the size of immediates.......

What about the 11-byte NOPs for bdver - worth doing even though we'll either need to hijack FeatureXOP or use up another feature bit?

Are we approaching a feature bit count limit? If not, I think it's better to make an explicit bit.

Not sure on padding through prefix/imm-extension - @craig.topper @spatel any ideas? We've recently been trying harder to reduce the size of immediates.......

What's the advantage of implicit padding via immediate widening vs. actual nops? I'm not familiar with how we do the padding currently, but it should be possible to choose the padding method in whatever pass handles that transform?

What about the 11-byte NOPs for bdver - worth doing even though we'll either need to hijack FeatureXOP or use up another feature bit?

Are we approaching a feature bit count limit? If not, I think it's better to make an explicit bit.

The feature bit limit is currently 192 and we're using 112. So we've got 80 more to go.

Not sure on padding through prefix/imm-extension - @craig.topper @spatel any ideas? We've recently been trying harder to reduce the size of immediates.......

What's the advantage of implicit padding via immediate widening vs. actual nops? I'm not familiar with how we do the padding currently, but it should be possible to choose the padding method in whatever pass handles that transform?

Actual nops take a decoder slot and emit a uop. Widening an instruction instead would avoid that.

Unfortunately, there's a not a pass that does the padding per se. Alignment is a field stored in the MachineBasicBlock until conversion to MC. At MC conversion the alignment is translated to an "alignment fragment". Instructions are immediately encoded into binary and stored in a "data fragment". The exception being relative jumps which start out with 1 byte offset and are stored in an "instruction fragment".

Once we have everything converted to fragments we go through an interative process to determine the final size and offset of every fragment. During this process we have to determine if a jump needs to be enlarged if the 1 byte offset isn't enough. If it does it has a ripple effect on every fragment after it. This can cause later jumps with negative offsets to need to be expanded as well. Once a jump has been expanded it won't be able to shrink again so eventually this process will terminate. Throughout the process the alignment fragment size is recalculated to satisfy the desired alignment. Once the iteration process finishes, the alignment fragment will be converted to NOPs and the jump instructions will be encoded. More information here https://eli.thegreenplace.net/2013/01/03/assembler-relaxation

So I don't know where immediate widening fits into that since we don't know the size of the nops we need until after most of the instructions have been encoded.

RKSimon updated this revision to Diff 131724.Jan 28 2018, 12:33 PM
RKSimon retitled this revision from [X86] Emit 15-byte NOPs on recent AMD targets, else default to 10-byte NOPs (PR22965) to [X86] Emit 11-byte or 15-byte NOPs on recent AMD targets, else default to 10-byte NOPs (PR22965).

Enabled fast 15-byte NOPs for btver1

Added fast 11-byte NOPs for bdver* targets (since we're not as short of feature bits as I thought.....)

In case anyone has any more comments, I'll keep this open for another couple of days before submission.

craig.topper added inline comments.Jan 28 2018, 11:48 PM
lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
355–356

Minor. Missing space between MaxNopLength and nops

This revision was automatically updated to reflect the committed changes.
reames added a subscriber: reames.Mar 9 2020, 3:13 PM

Replying to a very old review thread...

Looking at this commit, I can't find any information about which platform triggered the switch from 15 byte to 10 byte nops by default. From what I can tell reading through Agner's guides, it really looks like modern Intel's should also handle the 15 byte form (as they can decode an unlimited number of prefixes w/o stalls). Was this simply missed in the review discussion, or is there some bit of context here I'm missing?

p.s. In the original discussion, padding instructions for alignment was mentioned as an alternative. Amusingly enough, that's what I'm now working on which triggered this question. (see D75203 and D75300)

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2020, 3:13 PM

Looking at this commit, I can't find any information about which platform triggered the switch from 15 byte to 10 byte nops by default. From what I can tell reading through Agner's guides, it really looks like modern Intel's should also handle the 15 byte form (as they can decode an unlimited number of prefixes w/o stalls). Was this simply missed in the review discussion, or is there some bit of context here I'm missing?

IIRC the info for recent Intel targets wasn't available apart from a little info in Agner's docs and I think the Intel AOM recommended 10bytes. I'm not sure where the SLM limit came from and why its not all Atom arch but I seem to have kept it to 7bytes.

AMD's per-family SoGs usually recommend a upper limit so we used that info for each specific AMD cpu family.

As ever patches welcome for specific CPUs :-)