This is an archive of the discontinued LLVM Phabricator instance.

[X86] Use correct padding when in 16-bit mode
ClosedPublic

Authored by void on Feb 23 2021, 2:45 AM.

Details

Summary

In 16-bit mode, some of the nop patterns used in 32-bit mode can end up
mangling other instructions. For instance, an aligned "movz" instruction
may have the 0x66 and 0x67 prefixes omitted, because the nop that's used
messes things up.

xorl    %ebx, %ebx
.p2align 4, 0x90
movzbl  (%esi,%ebx), %ecx

Use instead nop patterns we know 16-bit mode can handle.

Diff Detail

Event Timeline

void created this revision.Feb 23 2021, 2:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2021, 2:45 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
void requested review of this revision.Feb 23 2021, 2:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2021, 2:45 AM

This change also modifies (breaks!) the 32-bit NOP set, definitely shouldn't do that. (But I think all that can be reverted, in any case).

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1075

I think we could just return 1 in 16-bit mode. There's no real reason to do nop-optimization in that mode, is there? Then the rest of the patch isn't necessary.

void added a comment.Feb 23 2021, 12:11 PM

This change also modifies (breaks!) the 32-bit NOP set, definitely shouldn't do that. (But I think all that can be reverted, in any case).

Which part did it break?

void updated this revision to Diff 325879.Feb 23 2021, 12:40 PM

Reduce size and effect of patch.

craig.topper added inline comments.Feb 23 2021, 2:35 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1125

Isn't this xchg %eax, %eax since 0x66 would override the register to 32-bit in 16-bit mode?

1127

Should these %esi's be %si like the ones below?

void updated this revision to Diff 325906.Feb 23 2021, 2:55 PM
void marked an inline comment as done.

Correct comments with proper disassembly.

void marked an inline comment as done.Feb 23 2021, 2:55 PM

This change also modifies (breaks!) the 32-bit NOP set, definitely shouldn't do that. (But I think all that can be reverted, in any case).

Which part did it break?

It replaced the single-instruction NOPs with new sequences consisting of multiple instructions.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1131

If you want to go the complex route, and support multi-byte nops in 16bit mode:

The 5 and larger-byte strings here should be removed, as they're combinations of smaller instructions. Leave only the initial 4 elements in the Nops16Bit array. Therefore, getMaximumNopSize() should return 4 in 16bit mode.

1141

This code is unnecessary.

Rename Nops to Nops32Bit, make both arrays have a second dimension of 11, and then you can do char (*Nops)[11] = STI.getFeatureBits()[X86::Mode16Bit])? Nops16Bit : Nops32Bit; -- and then the existing loop below is sufficient.

void added a comment.Feb 23 2021, 3:38 PM

This change also modifies (breaks!) the 32-bit NOP set, definitely shouldn't do that. (But I think all that can be reverted, in any case).

Which part did it break?

It replaced the single-instruction NOPs with new sequences consisting of multiple instructions.

That...shouldn't matter since a NOP is a NOP. Regardless, I removed that modification to simplify the patch.

void added a comment.Feb 23 2021, 3:45 PM

This change also modifies (breaks!) the 32-bit NOP set, definitely shouldn't do that. (But I think all that can be reverted, in any case).

Which part did it break?

It replaced the single-instruction NOPs with new sequences consisting of multiple instructions.

That...shouldn't matter since a NOP is a NOP. Regardless, I removed that modification to simplify the patch.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1131

That is needlessly complex. These patterns are from binutils, which claims that they're "efficient." The code in this patch is far easier to understand, in my opinion.

FWIW: I recall there being a header in the Linux kernel for various nop sleds: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/nops.h. I don't recall the context for why these different sequences exist. There seems to be commit messages about whether certain sequences are atomic or not. (Perhaps longer sequences are not interruptable by faults, for instance).

MaskRay added inline comments.Feb 23 2021, 5:59 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1075

I agree that performance in 16-bit mode is not important. We can just return 1 :)

llvm/test/MC/X86/code16gcc-align.s
26

The input contains many instructions that are not checked. They should just be replaced by .nops number.

FWIW: I recall there being a header in the Linux kernel for various nop sleds: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/nops.h. I don't recall the context for why these different sequences exist. There seems to be commit messages about whether certain sequences are atomic or not. (Perhaps longer sequences are not interruptable by faults, for instance).

The kernel uses at least 5-byte nop and 3-byte nop for code patching. Using long nops is just for performance reasons (better for the decoder). Using 1-byte nops is no worse than using long nops if performance is not a concern, as here we do for the 16-bit code.
(For code patching, atomics matter for the leading jmp/call instruction.)

That...shouldn't matter since a NOP is a NOP. Regardless, I removed that modification to simplify the patch.

It _does_ matter. To start with it can affect performance (marginally).

But more importantly, if you're doing tricky things with runtime code modification, it's a correctness issue. A trap won't leave the instruction pointer in the middle of a long-nop instruction (which you may be about to replace with another instruction), but it can leave it between two distinct nops.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1075

Actually, changing my mind here. Supporting 4-byte NOPs in 16-bit mode is useful, in order that .nops 4, 4 works (dunno if anyone actually does that in practice, but it is a reasonable thing to support.)

1131

Adding an entirely distinct second method for emitting NOPs is not the answer to complexity of the existing implementation.

Using the same code with two sets of data is less complexity overall -- just need a single LOC to choose the correct table...

void added a comment.Feb 24 2021, 1:21 PM

That...shouldn't matter since a NOP is a NOP. Regardless, I removed that modification to simplify the patch.

It _does_ matter. To start with it can affect performance (marginally).

But more importantly, if you're doing tricky things with runtime code modification, it's a correctness issue. A trap won't leave the instruction pointer in the middle of a long-nop instruction (which you may be about to replace with another instruction), but it can leave it between two distinct nops.

I argue that because gas is comfortable with it it's most likely a non-issue. Especially since you can't guarantee that a sequence of nops will always be one instruction long. But like I said I reverted that bit.

void marked an inline comment as done.Feb 24 2021, 2:21 PM
void updated this revision to Diff 326205.Feb 24 2021, 2:22 PM

Rework 16-bit nops to use existing loop.

Please reduce the test case per MaskRay's comment, then lg.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1076

Needs to move below the FeatureNOPL stanza.

void updated this revision to Diff 326218.Feb 24 2021, 3:22 PM

Reduce testcase.

MaskRay added inline comments.Feb 24 2021, 7:29 PM
llvm/test/MC/X86/code16gcc-align.s
8

Need a test for the 2-byte nop.

26

All the following instructions and directives can be deleted.

You can drop .LBB0_2 too.

void updated this revision to Diff 326274.Feb 24 2021, 8:38 PM

Test all nop sizes.

craig.topper added inline comments.Feb 25 2021, 12:22 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1079

Should this just be above the NOPL check if we're going to AND NOPL with !STI.hasFeature(X86::Mode16Bit).

Should we consistently use hasFeature in this function instead of mixing with getFeatureBits()?

void added inline comments.Feb 25 2021, 3:37 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1079

I was told to place this here and not at the start.

craig.topper added inline comments.Feb 25 2021, 3:40 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1079

Sorry. I realize that, but that request from James doesn't make sense to me unless James thought the 16-bit NOPs required NOPL. Which they don't. @jyknight can you confirm why you asked for that change?

void updated this revision to Diff 326533.Feb 25 2021, 3:40 PM

Use "hasFeature"

jyknight accepted this revision.Feb 25 2021, 7:35 PM
jyknight added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
1079

What I had intended to ask was that with !NOPL, the function should return 1 rather than 4 even in 16-bit mode.

And the reason I asked for that is if we were disabling the 2-byte nop in 32-bit mode, surely we should also be disabling it in 16-bit mode, too -- the 2-byte NOP is the same in the 16 and 32-bit nop lists. But, looking deeper, this was incorrect, because as far as I can tell there was never a reason for the code to be disabling 0x66 0x90 for !NOPL in the first place.

So, we could've/should've been returning at least 2 even with the current single table.

We could also provide a 3rd version of the nop-table for !16bit && !NOPL && !X86_64, having lea 0(%esi),%esi and lea 0(%esi,1),%esi as the 3/4-byte NOPs. I don't know if that'd be a good idea though -- I could imagine that using 3/4-byte LEA pseudo-NOPs instead of 1/2-byte actual NOPs might marginally degrade performance when people are compiling for generic-i686 and running on a modern CPU -- even though it does result in fewer instructions.

So, upshot: yep, just put this back the way you had it before. (And if you want to tackle the other issues too, that'd be fine, but also fine if not.)

This revision is now accepted and ready to land.Feb 25 2021, 7:35 PM
This revision was automatically updated to reflect the committed changes.