Page MenuHomePhabricator

Align branches within 32-Byte boundary(Prefix padding)
AbandonedPublic

Authored by skan on Jan 5 2020, 4:42 AM.

Details

Summary

This patch is an enhanced version of D70157.

When -x86-align-branch-prefix-size is 0 (default 0), it use NOP padding.
When -x86-align-branch-prefix-size is 1~5, it use Prefix padding.

The prefix padding strategy:

  1. First we try to add segment prefixes to instructions before a branch.
  2. If there is no sufficient room to add segment prefixes, NOP will be inserted before a branch.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
reames added inline comments.Jan 9 2020, 1:33 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
34

Remove the reordering here.

80–85

a) segment prefixes aren't the only ones used are they?
b) the wording changes can be pulled into their own review (or simply committed)

418

Please define an enum which gives a symbolic name for these values (if there isn't already one). (i.e. what the heck are these integer constant values?)

reames added a comment.Jan 9 2020, 2:47 PM

I'm having a really hard time wrapping my head around the implementation. Can you give a high level summary of the usage of BoundaryAlign after this change? A couple of examples w/all the fragments written out might also help a lot.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
396

This should probably be:
if (needAlign(Inst))

return false;
517

Please remove this. It should be covered by the compiler support patch which already landed for compiled code, and he assembler syntax is separate.

526

Please assert that the total number of prefixes fits within a uint8_t. It does, but having that explicitly asserted/noted would be helpful.

MaskRay added inline comments.Jan 9 2020, 6:24 PM
llvm/include/llvm/MC/MCFragment.h
538

You can keep EmitNops above. The first 2 bytes of MCBoundaryAlignFragment and the tail of MCFragment shared the same word.

548

Drop \name Accessors. It is not useful.

561

I find this method confusing. Updating the call sites to use hasEmitNops() || hasValue()

llvm/lib/MC/MCAssembler.cpp
1005

Is it guaranteed F->getPrevNode() will not be executed on the first Fragment?

1009

AlignedOffset can be defined after AlignedSize is computed.

1016

FixedValue doesn't have to be an immediately invoked function expression.

1026–1036

if (NewSize == BF.getSize()

1027

*(xxx) -> xxx->

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
133

AlignMaxPrefixSize = X86AlignBranchPrefixSize;

The error checking and normalization (to 5) should be done in an earlier place.

llvm/test/MC/X86/align-branch-64-2d.s
2

When writing tests, make sure llvm-mc and GNU as emit jmp of the same length. There are differences (D72197).

llvm/test/MC/X86/align-branch-64-8a.s
2

Use ## for comments.

x86_64-unknown-unknown can be simplified to x86_64 (the default is ELF).

skan updated this revision to Diff 237242.Jan 9 2020, 10:25 PM
skan marked 19 inline comments as done.
skan added inline comments.
llvm/lib/MC/MCAssembler.cpp
1009

Sorry, I didn't get your point.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
80–85

If there is no room to insert prefix, NOP will be emitted before the branch or fused pair.

133

Do you any suggestions about a appropriate, earlier place?

418

It is the exact value to be emitted when it has corresponding segment prefix. See X86MCCodeEmitter::emitSegmentOverridePrefix

llvm/test/MC/X86/align-branch-64-8a.s
2

I prefer to use x86_64-unknown-unknown, it seems more clear to me.

skan marked an inline comment as done.Jan 9 2020, 11:35 PM
skan added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
517

Remove this will cause test align-branch-32-2a.s to fail. This check is simply and direct, I think we can keep this currently.

skan marked an inline comment as done.Jan 9 2020, 11:51 PM
skan added inline comments.
llvm/test/MC/X86/align-branch-64-2d.s
2

According to my understanding, you are referring to the reallocation of call? I will update the tests when D72197 is landed.

skan updated this revision to Diff 237250.Jan 9 2020, 11:55 PM
skan updated this revision to Diff 237482.Jan 10 2020, 9:14 PM
skan added a comment.Jan 11 2020, 12:11 AM

Summary of the usage of MCBoundaryAlignFragment:

As I commented in MCFragment.h, MCBoundaryAlignFragment is a placeholder fragment used to emit NOP or values to align a set of fragments within specific boundary. And in this application, the value is the segment prefix indeed. For example, let's say the code is

pushl  %ebp
pushl  %edi
je  .L_0

and JCC is need to be aligned.
1. Fragment Layout
The corresponding fragment sequence will be
BoundaryAlign1 | Data1 | BoundaryAlign2 | Data2 | BoundaryAlign3 | Relax1

Data1 holds pushl %ebp, Data2 holds pushl %edi, Relax1 holds je .L_0. BoundaryAlign1 and BoundaryAlign2 are used to emit prefix,
BoundaryAlign3 is used to emit NOP.

2. Determine the range of the fragments needed to be aligned
MCBoundaryAlignFragment is designed to align a set of fragments within the same section as the BoundaryAlign. BoundaryAlign1~BoundaryAlign3 are used to align fragment Relax1. Each BoundaryAlign has a member called LastFragment, which marks the end of the set of fragments, if we call the nearest backward MCBoundaryAlignFragment of LastFragment as NBBF (short for nearest backward BoundaryAlign fragment), then the set of fragments to be aligned is (NBBF, LastFragment]. In this example, the LastFragment is Relax1 and the NBBF is BoundaryAlign3.

3. Relax the MCBoundaryAlignFragment
3.1 Prerequisites
Before relaxation, we should guarantee that the BoundaryAlign fragment is in the same section as (NBBF, LastFragment], and each non-MCBoundaryAlignFragment in (BoundaryAlign, LastFragment] must have a fixed size after finite times of relaxation. For example, if the code is

pushl  %ebp
.align 16
pushl  %edi
je  .L_0

the corresponding fragment sequence will be
BoundaryAlign1 | Data1 | Align1 | BoundaryAlign2 | Data2 | BoundaryAlign3 | Relax1

Align1 is MCAlignFragment that can grow and shrink, it is not guranteed to have a fixed size after finite times of relaxation, so the BoundaryAlign1's LastFragment should be NULL.
3.2 How to relax
Let's go back to the code

pushl  %ebp
pushl  %edi
je  .L_0

BoundaryAlign1 | Data1 | BoundaryAlign2 | Data2 | BoundaryAlign3 | Relax1

LastFragment = Relax1, NBBF = BoundaryAlign3.

The fragments are always relaxed from left to right, namely in each iteration, BoundaryAlign1 is relaxed first, then Data1 and then BoundaryAlign2. When we relax a BoundaryAlign, we will assume that all the BoundaryAlign in [this, LastFragment) are of size 0. For example, let's say the size of BoundaryAlign1 and BoundaryAlign2 is limited by 1, and size of BoundaryAlign3 is not limited, and the boundary size is 32.
In the first iteration and before Relax1 is relaxed, Relax1's offset is 0x1e and size is 2, so Relax1 needs 2-byte padding.

BoundaryAlign1(0) | Data1(1) | BoundaryAlign2(0) | Data2(1) | BoundaryAlign3(0) | Relax1(2, 0x1e)

After BoundaryAlign3 is relaxed, the sequence will be

BoundaryAlign1(1) | Data1(1) | BoundaryAlign2(1) | Data2(1) | BoundaryAlign3(0) | Relax1(2, 0x20)

In the second iteration and before Relax1 is relaxed, Relax1's offset may become 0x1a and size become 6.

BoundaryAlign1(1) | Data1(1) | BoundaryAlign2(1) | Data2(1) | BoundaryAlign3(0) | Relax1(6, 0x1a)

When relax BoudaryAlign1, we will assume Relax1's offset is 0x18 (0x1a -1 -1), so Relax1 needs 0-byte padding. After BoudaryAlign3 is relaxed,
the sequence will be

BoundaryAlign1(0) | Data1(1) | BoundaryAlign2(0) | Data2(1) | BoundaryAlign3(0) | Relax1(6, 0x18)

Note
If need to align fused jcc, 2 and 3 are same as above, only 1 is slightly different. When the code is

pushl  %ebp
pushl  %edi
cmp  %eax, %ebp
je  .L_0

The corresponding fragment sequence will be
BoundaryAlign1 | Data1 | BoundaryAlign2 | Data2 | BoundaryAlign3 | Data3 | Relax1

Data1 holds pushl %ebp, Data2 holds pushl %edi, Data3 holds cmp %eax, %ebp, Relax1 holds je .L_0. BoundaryAlign1 and BoundaryAlign2 are used to emit prefix, BoundaryAlign3 is used to emit NOP. LastFragment = Relax1, NBBF = BoundaryAlign3.

skan updated this revision to Diff 237569.Jan 12 2020, 6:13 PM

Rebase after D72197 is landed.

The prefix insertion logic is scattered (alignBranchesBegin, alignBranchesEnd, and relaxBoundaryAlign), though they are tightly coupled and share a fair amount of information. I am thinking whether placing the logic in one place will remove some abstraction and make the overall logic simpler to understand.

MCBoundaryAlignFragment is currently relaxed in the loop as MCRelaxableFragment. MCBoundaryAlignFragments are pre-allocated.

If we don't require MCBoundaryAlignFragment to be processed in the same loop as other fragments. We can (1) remove pre-allocation of MCBoundaryAlignFragment (2) layout everything (except MCBoundaryAlignFragment) (3) loop over fragments and compute the candidate placement points of MCBoundaryAlignFragment. When we find a jcc which requires padding, pick the candidates on its left and place padding. The candidates can be maintained as a slide window. Whenever we encounter a MCAlignFragment, clear the sliding window. After handling a jcc which requires padding, clear the sliding window as well.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
568

You can remove PF->setAlignment(AlignBoundary); here.

574

And here.

602

If you place F->setAlignment(AlignBoundary) here, you can avoid 2 setAlignment calls in X86AsmBackend::alignBranchesBegin.

619

It seems unnecessary to move it from alignBranchesBegin here.

skan marked 9 inline comments as done.Jan 13 2020, 12:44 AM
skan added inline comments.
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
568

The MCBoundaryAlignFragment may not emit anything, so the constructor of MCBoundaryAlignFragment doesn't set the AlignBoundary with value provided by the user. The data member AlignBoundary will be not corretly set here if PF->setAlignment(AlignBoundary) is removed.

574

I think we can not remove it.

602

If a MCBoundaryAlignFragment is not used to emit anything, I prefer it doesn't set the alignment. If we place F->setAlignment(AlignBoundary) here, the operation setAlignment(AlignBoundary) is unnecessary for those MCBoundaryAlignFragments that will not emit anything.

619

The purpose of moving PrevInst = Inst here is to make the early return in alignBranchesBegin simple. Otherwise, we need to write PrevInst = Inst; return in alignBranchesBegin` rather than return.

LuoYuanke added inline comments.Jan 13 2020, 1:01 AM
llvm/lib/MC/MCAssembler.cpp
616

Why we declare Value as int64_t in MCAlignFragment? It seems only 1 byte is needed. Or we add assert(BF.getValueSize() == 1)?

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
569

Just to confirm that there is no MCBoundaryAlignFragment between the 2 macro fusion instructions. Right?

skan marked 4 inline comments as done.Jan 13 2020, 1:11 AM

The prefix insertion logic is scattered (alignBranchesBegin, alignBranchesEnd, and relaxBoundaryAlign), though they are tightly coupled and share a fair amount of information. I am thinking whether placing the logic in one place will remove some abstraction and make the overall logic simpler to understand.

MCBoundaryAlignFragment is currently relaxed in the loop as MCRelaxableFragment. MCBoundaryAlignFragments are pre-allocated.

If we don't require MCBoundaryAlignFragment to be processed in the same loop as other fragments. We can (1) remove pre-allocation of MCBoundaryAlignFragment (2) layout everything (except MCBoundaryAlignFragment) (3) loop over fragments and compute the candidate placement points of MCBoundaryAlignFragment. When we find a jcc which requires padding, pick the candidates on its left and place padding. The candidates can be maintained as a slide window. Whenever we encounter a MCAlignFragment, clear the sliding window. After handling a jcc which requires padding, clear the sliding window as well.

Although the data structure used to hold MCFragment is doubly linked list, which means we can insert a fragment anywhere, the MCObjectStreamer only has the interface insert() to insert a fragment at the end of the list. It may be hard to add an interface for MCObjectStreamer to insert fragment anywhere safely, that's the reason why I preallocate the MCBoundaryAlignFragment.

If you have interest in designing such an interface for MCObjectStreamer and placing the logic in one place, you can reimplement the Prefix Padding after the patch is landed, I will be very glad to help you review that patch.

skan marked 4 inline comments as done.Jan 13 2020, 1:17 AM
skan added inline comments.
llvm/lib/MC/MCAssembler.cpp
616

I didn't decalre Value as int64_t, I declared it as Optional<uint8_t>

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
569

Right, there is no.

LuoYuanke added inline comments.Jan 13 2020, 2:57 AM
llvm/lib/MC/MCAssembler.cpp
616

You are right.

LuoYuanke accepted this revision.Jan 13 2020, 3:02 AM

The patch LGTM. I'd like to see the patch land before the llvm release.

This revision is now accepted and ready to land.Jan 13 2020, 3:02 AM
craig.topper added inline comments.Jan 13 2020, 11:31 AM
llvm/lib/MC/MCAssembler.cpp
1029

This 15 limit is X86 specific. Seems weird to have it mentioned in a target independent file. Can we abstract this somehow? It can probably happen after the branch.

1033

Same with this 15.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
415

I think this loop picks up instructions that don't use segment registers for just memory. And probably picks the wrong prefix for this instruction

mov %fs:0x1, %ss

The loop will find the %ss destination register first. But the %fs is the correct segment to use.

The patch LGTM. I'd like to see the patch land before the llvm release.

We can play it safe. NOP padding is available which should achieve most of the mitigation benefits (http://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html hw vs hw_sw_nop). This patch retrieves a bit more performance loss back at the risk of some more uncertainty. I can remove -mllvm -x86-align-branch-prefix-size= from D72463 and commit it (if it gets approved) before the 10.0.0 branch date (2020-01-15). We should definitely speed up reviewing this patch and make it available soon, so that downstream users playing with -mbranches-within-32B-boundaries.

If you have interest in designing such an interface for MCObjectStreamer and placing the logic in one place, you can reimplement the Prefix Padding after the patch is landed, I will be very glad to help you review that patch.

Thanks. I'll definitely think more in this area.

@skan @craig.topper This patch introduces an unknown bug. I applied Diff 15 (https://reviews.llvm.org/D72225?id=237569) + D72463 (clangDriver) + a local patch which enables -mbranches-within-32B-boundaries by default, then randonly picked 1000 tests and 40 failed. I will try providing a reproduce.

NOP padding alone seems good.

skan marked 2 inline comments as done.Jan 13 2020, 5:33 PM

@skan @craig.topper This patch introduces an unknown bug. I applied Diff 15 (https://reviews.llvm.org/D72225?id=237569) + D72463 (clangDriver) + a local patch which enables -mbranches-within-32B-boundaries by default, then randonly picked 1000 tests and 40 failed. I will try providing a reproduce.

NOP padding alone seems good.

Could you reproduce the fail with this patch only? Applying three patches together makes things complicated and can not prove there is something wrong. I am glad to wait for the reproduced fail for half day.

@skan @craig.topper This patch introduces an unknown bug. I applied Diff 15 (https://reviews.llvm.org/D72225?id=237569) + D72463 (clangDriver) + a local patch which enables -mbranches-within-32B-boundaries by default, then randonly picked 1000 tests and 40 failed. I will try providing a reproduce.

NOP padding alone seems good.

Could you reproduce the fail with this patch only? Applying three patches together makes things complicated and can not prove there is something wrong. I am glad to wait for the reproduced fail for half day.

  • -mllvm -x86-align-branch-boundary=32 -mllvm -x86-align-branch=fused,jcc,jmp -mllvm -x86-align-branch-prefix-size=0 => pass (NOP padding only)
  • -mllvm -x86-align-branch-boundary=32 -mllvm -x86-align-branch=fused,jcc,jmp -mllvm -x86-align-branch-prefix-size=5 => fail (this patch)

How can I reproduce with this patch only? Without a clangDriver patch, the code path added in this patch is not exercised and surely nothing breaks.

As I explained earlier, we already have NOP padding (it passes for our internal 1000 tests) With part of https://reviews.llvm.org/D72463 , we have something decent enough to ship for clang 10.0.0 (the difference between NOP padding alone and this patch is tiny). I understand that you eagerly want to ship the full feature for clang 10.0.0, but IMHO it is not very safe.

skan added a comment.Jan 13 2020, 6:33 PM

@skan @craig.topper This patch introduces an unknown bug. I applied Diff 15 (https://reviews.llvm.org/D72225?id=237569) + D72463 (clangDriver) + a local patch which enables -mbranches-within-32B-boundaries by default, then randonly picked 1000 tests and 40 failed. I will try providing a reproduce.

NOP padding alone seems good.

Could you reproduce the fail with this patch only? Applying three patches together makes things complicated and can not prove there is something wrong. I am glad to wait for the reproduced fail for half day.

  • -mllvm -x86-align-branch-boundary=32 -mllvm -x86-align-branch=fused,jcc,jmp -mllvm -x86-align-branch-prefix-size=0 => pass (NOP padding only)
  • -mllvm -x86-align-branch-boundary=32 -mllvm -x86-align-branch=fused,jcc,jmp -mllvm -x86-align-branch-prefix-size=5 => fail (this patch)

How can I reproduce with this patch only? Without a clangDriver patch, the code path added in this patch is not exercised and surely nothing breaks.

As I explained earlier, we already have NOP padding (it passes for our internal 1000 tests) With part of https://reviews.llvm.org/D72463 , we have something decent enough to ship for clang 10.0.0 (the difference between NOP padding alone and this patch is tiny). I understand that you eagerly want to ship the full feature for clang 10.0.0, but IMHO it is not very safe.

Passing option -mllvm -x86-align-branch-boundary=32 -mllvm -x86-align-branch=fused+jcc+jmp -mllvm -x86-align-branch-prefix-size=5 doesn't need any patch for driver.

skan updated this revision to Diff 237832.Jan 13 2020, 7:07 PM
skan marked an inline comment as done.
craig.topper added inline comments.Jan 13 2020, 7:19 PM
llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
545

limted->limited
peformance->performance

567

Your emitPrefix function includes the 2 byte and 3 byte VEX prefixes, the 4 byte EVEX prefix, and the 3 byte XOP prefix. The bytes after the leading byte of those prefixes can be any byte value and does not indicate a segment value.

skan updated this revision to Diff 237847.Jan 13 2020, 9:25 PM
skan marked 2 inline comments as done.
skan updated this revision to Diff 237848.Jan 13 2020, 9:34 PM

reorder the check in CanReuseDataFragment, check isBundlingEnabled before check allowAutoPadding()

skan updated this revision to Diff 237851.Jan 13 2020, 10:13 PM
MaskRay added a comment.EditedJan 14 2020, 12:15 AM

Here is one failure.

--x86-align-branch-prefix-size=0

    3444eca6:	64 48 8b 04 25 00 00 	mov    %fs:0x0,%rax
    3444ecad:	00 00 
    3444ecaf:	48 8d 80 60 f7 ff ff 	lea    -0x8a0(%rax),%rax
    3444ecb6:	83 78 04 00          	cmpl   $0x0,0x4(%rax)
    3444ecba:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    3444ecc0:	0f 88 08 02 00 00    	js     3444eece

--x86-align-branch-prefix-size=5

    3444fe66:	2e 2e 2e 2e 2e 64 48 	cs cs cs cs cs mov %fs:0x0,%rax
    3444fe6d:	8b 04 25 00 00 00 00 
    3444fe74:	48 8d 80 60 f7 ff ff 	lea    -0x8a0(%rax),%rax
    3444fe7b:	00 83 78 04 00 0f    	add    %al,0xf000478(%rbx)   ###### incorrect
    3444fe81:	88 08                	mov    %cl,(%rax)
    3444fe83:	02 00                	add    (%rax),%al
    3444fe85:	00 48 8d             	add    %cl,-0x73(%rax)
    3444fe88:	05 73 b4 44 01       	add    $0x144b473,%eax

I still suggest we ship something safer. If you think D72463 is acceptable, I can delete line 2022 (AlignBranchPrefixSize = 5;), and we can ship just NOP padding for clang 10.0.
https://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html NOP padding only has smaller code size increase and is good enough for the mitigation purposes.

skan added a comment.Jan 14 2020, 12:34 AM

Here is one failure.

--x86-align-branch-prefix-size=0

    3444eca6:	64 48 8b 04 25 00 00 	mov    %fs:0x0,%rax
    3444ecad:	00 00 
    3444ecaf:	48 8d 80 60 f7 ff ff 	lea    -0x8a0(%rax),%rax
    3444ecb6:	83 78 04 00          	cmpl   $0x0,0x4(%rax)
    3444ecba:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
    3444ecc0:	0f 88 08 02 00 00    	js     3444eece

--x86-align-branch-prefix-size=5

    3444fe66:	2e 2e 2e 2e 2e 64 48 	cs cs cs cs cs mov %fs:0x0,%rax
    3444fe6d:	8b 04 25 00 00 00 00 
    3444fe74:	48 8d 80 60 f7 ff ff 	lea    -0x8a0(%rax),%rax
    3444fe7b:	00 83 78 04 00 0f    	add    %al,0xf000478(%rbx)   ###### incorrect
    3444fe81:	88 08                	mov    %cl,(%rax)
    3444fe83:	02 00                	add    (%rax),%al
    3444fe85:	00 48 8d             	add    %cl,-0x73(%rax)
    3444fe88:	05 73 b4 44 01       	add    $0x144b473,%eax

I still suggest we ship something safer. If you think D72463 is acceptable, I can delete line 2022 (AlignBranchPrefixSize = 5;), and we can ship just NOP padding for clang 10.0.
https://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html NOP padding only has smaller code size increase and is good enough for the mitigation purposes.

Could you provide the corresponding assembly code?

MaskRay added a comment.EditedJan 14 2020, 1:10 AM

You cannot prepend prefixes to callq __tls_get_addr (General-Dynamic/Local-Dynamic TLS models). The code sequence is specially organized to allow linker relaxation. Prepending prefixes may cause the linker to mis-relax the code sequence.

--x86-align-branch-prefix-size=0

    4660: 0f 84 df 01 00 00            	je	479 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0x295>
    4666: 66 48 8d 3d 00 00 00 00      	leaq	(%rip), %rdi
		000000000000466a:  R_X86_64_TLSGD	__rseq_abi-0x4
    466e: 66 66 48 e8 00 00            	callw	0 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0xc4>
		0000000000004672:  R_X86_64_PLT32	__tls_get_addr-0x4
    4674: 00 00                        	addb	%al, (%rax)
    4676: 83 78 04 00                  	cmpl	$0, 4(%rax)
    467a: 66 0f 1f 44 00 00            	nopw	(%rax,%rax)
    4680: 0f 88 08 02 00 00            	js	520 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0x2de>

--x86-align-branch-prefix-size=5

    4660:	0f 84 df 01 00 00    	je     4845 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0x295>
    4666:	2e 2e 2e 2e 2e 66 48 	cs cs cs cs data16 lea %cs:0x0(%rip),%rdi        # 4673 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0xc3>
    466d:	8d 3d 00 00 00 00 
			466f: R_X86_64_TLSGD	__rseq_abi-0x4
    4673:	2e 66 66 48 e8 00 00 	cs data16 data16 callq 467c <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0xcc>
    467a:	00 00 
			4678: R_X86_64_PLT32	__tls_get_addr-0x4
    467c:	83 78 04 00          	cmpl   $0x0,0x4(%rax)
    4680:	0f 88 08 02 00 00    	js     488e <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0x2de>

Note, with -fno-plt, clang will emit calll *___tls_get_addr@GOT(%ebx) (32-bit) callq *__tls_get_addr@GOTPCREL(%rip) (64-bit). prefix-size= cannot alter such instructions, either.

I'll try investigating other issues in a few hours (>9).

skan added a comment.Jan 14 2020, 1:22 AM

I still suggest we ship something safer. If you think D72463 is acceptable, I can delete line 2022 (AlignBranchPrefixSize = 5;), and we can ship just NOP padding for clang 10.0.
https://lists.llvm.org/pipermail/llvm-dev/2019-December/137610.html NOP padding only has smaller code size increase and is good enough for the mitigation purposes.

If you think NOP padding is safer. I suggest we can make the general option -mbranches-within-32B-boundaries equivalent to -malign-branch=fused,jcc,jmp -malign-branch-boundary=32 -malign-branch-prefix-size=0.

skan added a comment.Jan 14 2020, 1:52 AM

You cannot prepend prefixes to callq __tls_get_addr (General-Dynamic/Local-Dynamic TLS models). The code sequence is specially organized to allow linker relaxation. Prepending prefixes may cause the linker to mis-relax the code sequence.

--x86-align-branch-prefix-size=0

    4660: 0f 84 df 01 00 00            	je	479 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0x295>
    4666: 66 48 8d 3d 00 00 00 00      	leaq	(%rip), %rdi
		000000000000466a:  R_X86_64_TLSGD	__rseq_abi-0x4
    466e: 66 66 48 e8 00 00            	callw	0 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0xc4>
		0000000000004672:  R_X86_64_PLT32	__tls_get_addr-0x4
    4674: 00 00                        	addb	%al, (%rax)
    4676: 83 78 04 00                  	cmpl	$0, 4(%rax)
    467a: 66 0f 1f 44 00 00            	nopw	(%rax,%rax)
    4680: 0f 88 08 02 00 00            	js	520 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0x2de>

--x86-align-branch-prefix-size=5

    4660:	0f 84 df 01 00 00    	je     4845 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0x295>
    4666:	2e 2e 2e 2e 2e 66 48 	cs cs cs cs data16 lea %cs:0x0(%rip),%rdi        # 4673 <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0xc3>
    466d:	8d 3d 00 00 00 00 
			466f: R_X86_64_TLSGD	__rseq_abi-0x4
    4673:	2e 66 66 48 e8 00 00 	cs data16 data16 callq 467c <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0xcc>
    467a:	00 00 
			4678: R_X86_64_PLT32	__tls_get_addr-0x4
    467c:	83 78 04 00          	cmpl   $0x0,0x4(%rax)
    4680:	0f 88 08 02 00 00    	js     488e <_ZN12_GLOBAL__N_116do_free_no_hooksEPv+0x2de>

Note, with -fno-plt, clang will emit calll *___tls_get_addr@GOT(%ebx) (32-bit) callq *__tls_get_addr@GOTPCREL(%rip) (64-bit). prefix-size= cannot alter such instructions, either.

As far as I know, TLSCALL must have a variant symbol, e.g. call ___tls_get_addr@PLT, call *___tls_get_addr@GOT(%ecx). The patch did not prepend prefixes to an instruction with variant symbol, which is guranteed by function X86AsmBackend::shouldAddPrefix(). And we can check that with test case

  .text
  .globl  foo
  .p2align  4
foo:
  .rept 5
  call    ___tls_get_addr@PLT
  .endr
  cmp     %eax, %ebp
  je      foo

Did I miss any TLSCALL?

skan updated this revision to Diff 237900.Jan 14 2020, 2:41 AM

Only add one test case align-branch-64-9a.s to prove that prefix was not prepended to instruction that has variant symbol operand. (e.g. TLSCALL)

skan updated this revision to Diff 237928.Jan 14 2020, 4:06 AM

Make sure prefixes are not prepended to any CALL, JUMP,RET

I have serious reservations about the rush to land this patch. I have expressed some of this privately, and other bits have been in previous review threads, but I want to put everything in one public place.

I don't see a strong value in having this in the current LLVM release. We have support for nop padding, and the delta between nop padding and prefix padding is minimal. My personal take is that wrapping up all of the wiring to provide the clang driver option to enable padding (via whatever mechanism is available, for now nops) should take preference over making the padding code marginally better.

I am also deeply uncomfortable that there does not appear to have been any meaningful progress on defining an assembler syntax for this feature. I understand - and in fact pushed for - the urgency in getting a mitigation out, but that urgency is now gone. We have a mitigation which closes most of the gap, we can now focus on ensuring that we're well positioned for the long term.

Returning specifically to this patch, I see the following technical concerns:

  1. The use a black list - instead of a white list - for deciding which instructions can be safely padded has already been shown to be problematic correctness wise. This seems to ignore the possibility that an instruction might *already have* prefixes when deciding how many to insert, and it's not clear to me that all chosen prefixes are valid on *all* instructions.
  2. The fact that *every single instruction* ends up in it's own fragment is a *huge* increase in memory usage. This has been brought up repeatedly (see original thread), but I have seen *no* data provided to make the overhead concrete. There are several ways to potentially mitigate this, but it doesn't appeared to have been assessed.
  3. Stylistic points - such as defining an enum with names for prefixes - previously raised in review comments have not been addressed.
  4. I see little discussion of what validation has been done on this patch. The only comments I can find - https://reviews.llvm.org/D72225#1818149 - seems to indicate that fairly basic testing exposes functional issues. That does not inspire confidence in the quality of the implementation.

Out of these, (4) and (2) are by far the most worrying.

Putting all this together, I don't think this patch should land at this time.

MaskRay added a comment.EditedJan 14 2020, 4:52 PM

Diff 21 is still incorrect. I'll give a reproduce.

(DATA16_PREFIX)
(LEA64r)
<--- Diff 21 can place cs (0x2e) here and break the General Dynamic TLS code sequence --->
<MCInst 851> (DATA16_PREFIX)
<MCInst 851> (DATA16_PREFIX)
<MCInst 2450> (REX64_PREFIX)
<MCInst 602 <MCOperand Expr:(__tls_get_addr@PLT)>> CALL64pcrel32
skan added a comment.Jan 14 2020, 8:05 PM

It seems we need more time to have further discussion. I will hold this patch until the issues mentioned by @reames @MaskRay are resolved. Welcome to post any fail tests.

Diff 21 is still incorrect. I'll give a reproduce.

(DATA16_PREFIX)
(LEA64r)
<--- Diff 21 can place cs (0x2e) here and break the General Dynamic TLS code sequence --->
<MCInst 851> (DATA16_PREFIX)
<MCInst 851> (DATA16_PREFIX)
<MCInst 2450> (REX64_PREFIX)
<MCInst 602 <MCOperand Expr:(__tls_get_addr@PLT)>> CALL64pcrel32

Pls. give us a C or assembly file to reproduce. Thx a lot.

llvm/include/llvm/MC/MCAsmBackend.h
59 ↗(On Diff #237007)

No semicolon in the end.

I have serious reservations about the rush to land this patch. I have expressed some of this privately, and other bits have been in previous review threads, but I want to put everything in one public place.

I don't see a strong value in having this in the current LLVM release. We have support for nop padding, and the delta between nop padding and prefix padding is minimal. My personal take is that wrapping up all of the wiring to provide the clang driver option to enable padding (via whatever mechanism is available, for now nops) should take preference over making the padding code marginally better.

Regarding prefix padding vs. nop padding, we observed 0.3% improvement in SPECINT geomean, and 0.5% in SPECFP geomean. It's not that much in geomean. But we saw 1~2% improvement in specific SPEC benchmarks. We also observed cases in which nop padding doesn't mitigate the effect very well, but prefix padding does. That's the reason we intend to land the prefix padding into LLVM10. So users have alternative approaches to mitigate the JCC microcode update. They can choose whatever provides better performance.
Anyway, I understand your concern and agree to hold the patch until it's pretty mature.

I am also deeply uncomfortable that there does not appear to have been any meaningful progress on defining an assembler syntax for this feature. I understand - and in fact pushed for - the urgency in getting a mitigation out, but that urgency is now gone. We have a mitigation which closes most of the gap, we can now focus on ensuring that we're well positioned for the long term.

I saw you proposed an assembler syntax in https://reviews.llvm.org/D71315. So I suppose you will continue to drive it. We can co-work on it to finalize the syntax and implement it.

skan planned changes to this revision.Jan 16 2020, 7:22 PM
data16
leaq    i@TLSGD(%rip), %rdi
data16
data16
rex64
callq   __tls_get_addr@PLT

data16 is emitted as an instruction, though it is the prefix of the leaq. Prefix should not prepend to data16, I will fix it.

skan abandoned this revision.Mar 19 2020, 8:27 PM

Reimplemented this by D76286 based on D75300