This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86FrameLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
stack-clash-medium-natural-probes-mutliple-objects.ll
-
stack-clash-medium-natural-probes.ll
-
stack-clash-medium.ll
-
stack-clash-small-alloc-medium-align.ll
-
stack-clash-unknown-call.ll

Differential D98906

[X86] Improve lowering of the unrolled inline-asm probing
Needs ReviewPublic

Authored by nagisa on Mar 18 2021, 4:55 PM.

Download Raw Diff

Details

Reviewers

serge-sans-paille
efriedma
lkail

Summary

This new implementation emits instructions such as these:

movb $0, -4096(%rsp)

which is both faster and smaller than pairs of

sub $4096, %rsp
movq $0, (%rsp)

This implementation also trivially preserves the preciseness of the
uwtables during the preamble by not modifying the stack pointer in the
first place.

Testing the generated code for stacks of 0x4000 bytes (4 probes) llvm-mca reports
(over 100 iterations):

test case	mcpu	cycles	IPC	RThroughput
old	znver2	603	1.66	2.5
new	znver2	204	2.94	1.5
old	skylake	603	1.66	4.0
new	skylake	403	1.49	4.0
old	bdver1	603	1.66	6.0
new	bdver1	403	1.49	4.0
old	haswell	603	1.66	4.0
new	haswell	403	1.49	4.0

So overall in terms of throughput its either the same or
an improvement.

Depends on D98909

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

nagisa created this revision.Mar 18 2021, 4:55 PM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptMar 18 2021, 4:55 PM

update tests to not scrub away stack pointers

nagisa mentioned this in D98909: [X86, NFC] Update stack-clash tests using the automated tooling.Mar 18 2021, 5:20 PM

rebase on top of tests in D98909

nagisa edited the summary of this revision. (Show Details)Mar 18 2021, 5:27 PM

nagisa added a parent revision: D98909: [X86, NFC] Update stack-clash tests using the automated tooling.

Harbormaster completed remote builds in B94588: Diff 331731.Mar 18 2021, 5:27 PM

nagisa retitled this revision from [x86] Improve lowering of unrolled inline-asm probing to [X86] Improve lowering of unrolled inline-asm probing.Mar 18 2021, 5:27 PM

nagisa edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B94582: Diff 331723.Mar 18 2021, 5:33 PM

Harbormaster completed remote builds in B94583: Diff 331724.Mar 18 2021, 5:43 PM

restore the code that handled overaligned allocs

nagisa published this revision for review.Mar 18 2021, 6:37 PM

nagisa added reviewers: serge-sans-paille, efriedma, lkail.

nagisa added a subscriber: YangKeao.

Herald added a project: Restricted Project. · View Herald TranscriptMar 18 2021, 6:37 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

nagisa retitled this revision from [X86] Improve lowering of unrolled inline-asm probing to [X86] Improve lowering of the unrolled inline-asm probing.Mar 18 2021, 6:37 PM

Harbormaster completed remote builds in B94597: Diff 331741.Mar 18 2021, 7:09 PM

nagisa mentioned this in D98789: [PEI] add dwarf information for stack probe.Mar 19 2021, 2:44 AM

nagisa mentioned this in rGc2313a45307e: [X86, NFC] Update stack-clash tests using the automated tooling.Mar 19 2021, 5:02 AM

nagisa edited the summary of this revision. (Show Details)Mar 21 2021, 7:16 AM

Thanks for proposing this optimization. There's a reason (which is debatable) why we explicitly sub before moving :

Consider the following:

0000000000400400 <main>:
  400400:       c7 84 24 fc bf ff ff    movl   $0x1,-0x4004(%rsp)
  400407:       01 00 00 00 
  40040b:       c3                      retq

the mov accesses « unallocated » stack, which may be spotted as illegal by tools verifying memory accesses. When running valgrind on the above, I indeed get

==1335611== Invalid write of size 4
==1335611==    at 0x400400: main (in /tmp/a.out)
==1335611==  Address 0x1ffeffb534 is not stack'd, malloc'd or (recently) free'd

Ah, I see!

Well, this isn't strictly just an optimization. This sequence came to me as I was thinking how to make CFI metadata correct with least amount of work on my end. It seems that in presence of tools such as valgrind we don't really have any other choice than to carefully emit a -cfi_def_cfa_offset for each of the subs. Which is fine I guess, but leaves me not super happy about the code we emit.

Besides the accurate uwtables, there's another conflicting use-case here – signal handlers. If we sub and then probe, we may receive a signal (and allocate a stack slot for it) in a potentially invalid stack area. Whereas if we probe and then sub, any signal handlers that occur during probing would still execute on what is significantly more likely to be a valid stack.

So, I guess, my question here is this: Is the codegen responsible for generating code that's palatable to analysis tools, or are the analysis tools responsible for comprehending the code that they inspect?

Be careful about referencing stack locations below the red-zone.

The x86-64 ABI mandates a 128 bytes red zone. Your new instruction would effectively write a zero at a location below the red-zone.
That region of memory is not reserved, and should be considered volatile. For example: signals and interrupt handlers are allowed to modify it.

That's the reason why, for leaf functions, compilers always emit a SUB of RSP in the case of too big negative offsets (w.r.t. RSP).

In D98906#2641418, @andreadb wrote:

Be careful about referencing stack locations below the red-zone.

The x86-64 ABI mandates a 128 bytes red zone. Your new instruction would effectively write a zero at a location below the red-zone.
That region of memory is not reserved, and should be considered volatile. For example: signals and interrupt handlers are allowed to modify it.

Isn't the volatility unimportant for the purposes of stack probing? The write exists only to poke the page (lightly) to ensure there's no page that is not rw, probing does not particularly care if the data at the relevant addresses are overwritten, or even that the byte is written to the address in the first place (as long as it triggers page permission checks). The only concern could be that some signal handler or an interrupt put some data there, but when these are running the probing itself is suspended, isn't it?

In D98906#2641469, @nagisa wrote:

In D98906#2641418, @andreadb wrote:

Be careful about referencing stack locations below the red-zone.

The x86-64 ABI mandates a 128 bytes red zone. Your new instruction would effectively write a zero at a location below the red-zone.
That region of memory is not reserved, and should be considered volatile. For example: signals and interrupt handlers are allowed to modify it.

Isn't the volatility unimportant for the purposes of stack probing? The write exists only to poke the page (lightly) to ensure there's no page that is not rw, probing does not particularly care if the data at the relevant addresses are overwritten, or even that the byte is written to the address in the first place (as long as it triggers page permission checks). The only concern could be that some signal handler or an interrupt put some data there, but when these are running the probing itself is suspended, isn't it?

I agree. If the goal is just doing stack probing then it should be fine.

Mine was more of a generic "be careful about the red-zone". Just to make sure that it was considered in this design. I didn't know about your particular use case scenario though.

Besides the accurate uwtables, there's another conflicting use-case here – signal handlers. If we sub and then probe, we may receive a signal (and allocate a stack slot for it) in a potentially invalid stack area. Whereas if we probe and then sub, any signal handlers that occur during probing would still execute on what is significantly more likely to be a valid stack.

If my understanding is correct, (and in the case where an alternate stack is not used) the signal handler begins with a push rbp so it should touch the stack and triggers the PAGE_GUARD... mmmh unless we're right at the end of the page guard after the sub, so maybe we should subq $4088 ?

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86FrameLowering.cpp

45 lines

test/

CodeGen/

X86/

stack-clash-medium-natural-probes-mutliple-objects.ll

5 lines

stack-clash-medium-natural-probes.ll

5 lines

stack-clash-medium.ll

10 lines

stack-clash-small-alloc-medium-align.ll

13 lines

stack-clash-unknown-call.ll

5 lines

Diff 331741

llvm/lib/Target/X86/X86FrameLowering.cpp

	Show First 20 Lines • Show All 550 Lines • ▼ Show 20 Lines
	void X86FrameLowering::emitStackProbeInlineGenericBlock(			void X86FrameLowering::emitStackProbeInlineGenericBlock(
	MachineFunction &MF, MachineBasicBlock &MBB,			MachineFunction &MF, MachineBasicBlock &MBB,
	MachineBasicBlock::iterator MBBI, const DebugLoc &DL, uint64_t Offset,			MachineBasicBlock::iterator MBBI, const DebugLoc &DL, uint64_t Offset,
	uint64_t AlignOffset) const {			uint64_t AlignOffset) const {

	const X86Subtarget &STI = MF.getSubtarget<X86Subtarget>();			const X86Subtarget &STI = MF.getSubtarget<X86Subtarget>();
	const X86TargetLowering &TLI = *STI.getTargetLowering();			const X86TargetLowering &TLI = *STI.getTargetLowering();
	const unsigned Opc = getSUBriOpcode(Uses64BitFramePtr, Offset);			const unsigned Opc = getSUBriOpcode(Uses64BitFramePtr, Offset);
	const unsigned MovMIOpc = Is64Bit ? X86::MOV64mi32 : X86::MOV32mi;
	const uint64_t StackProbeSize = TLI.getStackProbeSize(MF);			const uint64_t StackProbeSize = TLI.getStackProbeSize(MF);

	uint64_t CurrentOffset = 0;			uint64_t CurrentOffset = 0;

	assert(AlignOffset < StackProbeSize);			assert(AlignOffset < StackProbeSize);

	// If the offset is so small it fits within a page, there's nothing to do.
	if (StackProbeSize < Offset + AlignOffset) {			if (StackProbeSize < Offset + AlignOffset) {
				NumFrameExtraProbe++;
	MachineInstr *MI = BuildMI(MBB, MBBI, DL, TII.get(Opc), StackPtr)			CurrentOffset = StackProbeSize - AlignOffset;
	.addReg(StackPtr)			addRegOffset(BuildMI(MBB, MBBI, DL, TII.get(X86::MOV8mi))
	.addImm(StackProbeSize - AlignOffset)
	.setMIFlag(MachineInstr::FrameSetup);
	MI->getOperand(3).setIsDead(); // The EFLAGS implicit def is dead.

	addRegOffset(BuildMI(MBB, MBBI, DL, TII.get(MovMIOpc))
	.setMIFlag(MachineInstr::FrameSetup),			.setMIFlag(MachineInstr::FrameSetup),
	StackPtr, false, 0)			StackPtr, false, -CurrentOffset)
	.addImm(0)			.addImm(0)
	.setMIFlag(MachineInstr::FrameSetup);			.setMIFlag(MachineInstr::FrameSetup);
	NumFrameExtraProbe++;
	CurrentOffset = StackProbeSize - AlignOffset;
	}			}

	// For the next N - 1 pages, just probe. I tried to take advantage of			// For the remaining N - 1 pages, probe.
	// natural probes but it implies much more logic and there was very few			//
	// interesting natural probes to interleave.			// We emit the most basic `movb $0, -offset(%rsp)` instruction which is good
				// for offsets of up-to 2GB. This is also most throughput and space efficient
				// encoding that I (nagisa) could come up.
				//
				// It also naturally doesn't need any special handling for precise uwtables.
	while (CurrentOffset + StackProbeSize < Offset) {			while (CurrentOffset + StackProbeSize < Offset) {
	MachineInstr *MI = BuildMI(MBB, MBBI, DL, TII.get(Opc), StackPtr)			NumFrameExtraProbe++;
	.addReg(StackPtr)			CurrentOffset += StackProbeSize;
	.addImm(StackProbeSize)			addRegOffset(BuildMI(MBB, MBBI, DL, TII.get(X86::MOV8mi))
	.setMIFlag(MachineInstr::FrameSetup);
	MI->getOperand(3).setIsDead(); // The EFLAGS implicit def is dead.


	addRegOffset(BuildMI(MBB, MBBI, DL, TII.get(MovMIOpc))
	.setMIFlag(MachineInstr::FrameSetup),			.setMIFlag(MachineInstr::FrameSetup),
	StackPtr, false, 0)			StackPtr, false, -CurrentOffset)
	.addImm(0)			.addImm(0)
	.setMIFlag(MachineInstr::FrameSetup);			.setMIFlag(MachineInstr::FrameSetup);
	NumFrameExtraProbe++;
	CurrentOffset += StackProbeSize;
	}			}

	// No need to probe the tail, it is smaller than a Page.			// No need to probe the tail, it is smaller than a Page.
	uint64_t ChunkSize = Offset - CurrentOffset;
	MachineInstr *MI = BuildMI(MBB, MBBI, DL, TII.get(Opc), StackPtr)			MachineInstr *MI = BuildMI(MBB, MBBI, DL, TII.get(Opc), StackPtr)
	.addReg(StackPtr)			.addReg(StackPtr)
	.addImm(ChunkSize)			.addImm(Offset)
	.setMIFlag(MachineInstr::FrameSetup);			.setMIFlag(MachineInstr::FrameSetup);
	MI->getOperand(3).setIsDead(); // The EFLAGS implicit def is dead.			MI->getOperand(3).setIsDead(); // The EFLAGS implicit def is dead.
	}			}

	void X86FrameLowering::emitStackProbeInlineGenericLoop(			void X86FrameLowering::emitStackProbeInlineGenericLoop(
	MachineFunction &MF, MachineBasicBlock &MBB,			MachineFunction &MF, MachineBasicBlock &MBB,
	MachineBasicBlock::iterator MBBI, const DebugLoc &DL, uint64_t Offset,			MachineBasicBlock::iterator MBBI, const DebugLoc &DL, uint64_t Offset,
	uint64_t AlignOffset) const {			uint64_t AlignOffset) const {
	▲ Show 20 Lines • Show All 2,985 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/stack-clash-medium-natural-probes-mutliple-objects.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp
	; RUN: llc < %s \| FileCheck %s			; RUN: llc < %s \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	define i32 @foo() local_unnamed_addr #0 {			define i32 @foo() local_unnamed_addr #0 {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: subq $4096, %rsp # imm = 0x1000			; CHECK-NEXT: movb $0, -4096(%rsp)
	; CHECK-NEXT: movq $0, (%rsp)			; CHECK-NEXT: subq $5880, %rsp # imm = 0x16F8
	; CHECK-NEXT: subq $1784, %rsp # imm = 0x6F8
	; CHECK-NEXT: .cfi_def_cfa_offset 5888			; CHECK-NEXT: .cfi_def_cfa_offset 5888
	; CHECK-NEXT: movl $1, 3872(%rsp)			; CHECK-NEXT: movl $1, 3872(%rsp)
	; CHECK-NEXT: movl $2, 672(%rsp)			; CHECK-NEXT: movl $2, 672(%rsp)
	; CHECK-NEXT: movl 1872(%rsp), %eax			; CHECK-NEXT: movl 1872(%rsp), %eax
	; CHECK-NEXT: addq $5880, %rsp # imm = 0x16F8			; CHECK-NEXT: addq $5880, %rsp # imm = 0x16F8
	; CHECK-NEXT: .cfi_def_cfa_offset 8			; CHECK-NEXT: .cfi_def_cfa_offset 8
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%a = alloca i32, i64 1000, align 16			%a = alloca i32, i64 1000, align 16
	Show All 10 Lines

llvm/test/CodeGen/X86/stack-clash-medium-natural-probes.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp
	; RUN: llc < %s \| FileCheck %s			; RUN: llc < %s \| FileCheck %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	define i32 @foo() local_unnamed_addr #0 {			define i32 @foo() local_unnamed_addr #0 {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: subq $4096, %rsp # imm = 0x1000			; CHECK-NEXT: movb $0, -4096(%rsp)
	; CHECK-NEXT: movq $0, (%rsp)			; CHECK-NEXT: subq $7880, %rsp # imm = 0x1EC8
	; CHECK-NEXT: subq $3784, %rsp # imm = 0xEC8
	; CHECK-NEXT: .cfi_def_cfa_offset 7888			; CHECK-NEXT: .cfi_def_cfa_offset 7888
	; CHECK-NEXT: movl $1, 264(%rsp)			; CHECK-NEXT: movl $1, 264(%rsp)
	; CHECK-NEXT: movl $1, 4664(%rsp)			; CHECK-NEXT: movl $1, 4664(%rsp)
	; CHECK-NEXT: movl -128(%rsp), %eax			; CHECK-NEXT: movl -128(%rsp), %eax
	; CHECK-NEXT: addq $7880, %rsp # imm = 0x1EC8			; CHECK-NEXT: addq $7880, %rsp # imm = 0x1EC8
	; CHECK-NEXT: .cfi_def_cfa_offset 8			; CHECK-NEXT: .cfi_def_cfa_offset 8
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%a = alloca i32, i64 2000, align 16			%a = alloca i32, i64 2000, align 16
	Show All 9 Lines

llvm/test/CodeGen/X86/stack-clash-medium.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp
	; RUN: llc -mtriple=x86_64-linux-android < %s \| FileCheck -check-prefix=CHECK-X86-64 %s			; RUN: llc -mtriple=x86_64-linux-android < %s \| FileCheck -check-prefix=CHECK-X86-64 %s
	; RUN: llc -mtriple=i686-linux-android < %s \| FileCheck -check-prefix=CHECK-X86-32 %s			; RUN: llc -mtriple=i686-linux-android < %s \| FileCheck -check-prefix=CHECK-X86-32 %s

	define i32 @foo() local_unnamed_addr #0 {			define i32 @foo() local_unnamed_addr #0 {
	; CHECK-X86-64-LABEL: foo:			; CHECK-X86-64-LABEL: foo:
	; CHECK-X86-64: # %bb.0:			; CHECK-X86-64: # %bb.0:
	; CHECK-X86-64-NEXT: subq $4096, %rsp # imm = 0x1000			; CHECK-X86-64-NEXT: movb $0, -4096(%rsp)
	; CHECK-X86-64-NEXT: movq $0, (%rsp)			; CHECK-X86-64-NEXT: subq $7880, %rsp # imm = 0x1EC8
	; CHECK-X86-64-NEXT: subq $3784, %rsp # imm = 0xEC8
	; CHECK-X86-64-NEXT: .cfi_def_cfa_offset 7888			; CHECK-X86-64-NEXT: .cfi_def_cfa_offset 7888
	; CHECK-X86-64-NEXT: movl $1, 672(%rsp)			; CHECK-X86-64-NEXT: movl $1, 672(%rsp)
	; CHECK-X86-64-NEXT: movl -128(%rsp), %eax			; CHECK-X86-64-NEXT: movl -128(%rsp), %eax
	; CHECK-X86-64-NEXT: addq $7880, %rsp # imm = 0x1EC8			; CHECK-X86-64-NEXT: addq $7880, %rsp # imm = 0x1EC8
	; CHECK-X86-64-NEXT: .cfi_def_cfa_offset 8			; CHECK-X86-64-NEXT: .cfi_def_cfa_offset 8
	; CHECK-X86-64-NEXT: retq			; CHECK-X86-64-NEXT: retq
	;			;
	; CHECK-X86-32-LABEL: foo:			; CHECK-X86-32-LABEL: foo:
	; CHECK-X86-32: # %bb.0:			; CHECK-X86-32: # %bb.0:
	; CHECK-X86-32-NEXT: subl $4096, %esp # imm = 0x1000			; CHECK-X86-32-NEXT: movb $0, -4096(%esp)
	; CHECK-X86-32-NEXT: movl $0, (%esp)			; CHECK-X86-32-NEXT: subl $8012, %esp # imm = 0x1F4C
	; CHECK-X86-32-NEXT: subl $3916, %esp # imm = 0xF4C
	; CHECK-X86-32-NEXT: .cfi_def_cfa_offset 8016			; CHECK-X86-32-NEXT: .cfi_def_cfa_offset 8016
	; CHECK-X86-32-NEXT: movl $1, 800(%esp)			; CHECK-X86-32-NEXT: movl $1, 800(%esp)
	; CHECK-X86-32-NEXT: movl (%esp), %eax			; CHECK-X86-32-NEXT: movl (%esp), %eax
	; CHECK-X86-32-NEXT: addl $8012, %esp # imm = 0x1F4C			; CHECK-X86-32-NEXT: addl $8012, %esp # imm = 0x1F4C
	; CHECK-X86-32-NEXT: .cfi_def_cfa_offset 4			; CHECK-X86-32-NEXT: .cfi_def_cfa_offset 4
	; CHECK-X86-32-NEXT: retl			; CHECK-X86-32-NEXT: retl
	%a = alloca i32, i64 2000, align 16			%a = alloca i32, i64 2000, align 16
	%b = getelementptr inbounds i32, i32* %a, i64 200			%b = getelementptr inbounds i32, i32* %a, i64 200
	store volatile i32 1, i32* %b			store volatile i32 1, i32* %b
	%c = load volatile i32, i32* %a			%c = load volatile i32, i32* %a
	ret i32 %c			ret i32 %c
	}			}

	attributes #0 = {"probe-stack"="inline-asm"}			attributes #0 = {"probe-stack"="inline-asm"}

llvm/test/CodeGen/X86/stack-clash-small-alloc-medium-align.ll

	Show All 32 Lines
	; CHECK-LABEL: foo2:			; CHECK-LABEL: foo2:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: pushq %rbp			; CHECK-NEXT: pushq %rbp
	; CHECK-NEXT: .cfi_def_cfa_offset 16			; CHECK-NEXT: .cfi_def_cfa_offset 16
	; CHECK-NEXT: .cfi_offset %rbp, -16			; CHECK-NEXT: .cfi_offset %rbp, -16
	; CHECK-NEXT: movq %rsp, %rbp			; CHECK-NEXT: movq %rsp, %rbp
	; CHECK-NEXT: .cfi_def_cfa_register %rbp			; CHECK-NEXT: .cfi_def_cfa_register %rbp
	; CHECK-NEXT: andq $-2048, %rsp # imm = 0xF800			; CHECK-NEXT: andq $-2048, %rsp # imm = 0xF800
	; CHECK-NEXT: subq $2048, %rsp # imm = 0x800			; CHECK-NEXT: movb $0, -2048(%rsp)
	; CHECK-NEXT: movq $0, (%rsp)			; CHECK-NEXT: movb $0, -6144(%rsp)
	; CHECK-NEXT: subq $4096, %rsp # imm = 0x1000			; CHECK-NEXT: subq $8192, %rsp # imm = 0x2000
	; CHECK-NEXT: movq $0, (%rsp)
	; CHECK-NEXT: subq $2048, %rsp # imm = 0x800
	; CHECK-NEXT: movl $1, (%rsp,%rdi,4)			; CHECK-NEXT: movl $1, (%rsp,%rdi,4)
	; CHECK-NEXT: movl (%rsp), %eax			; CHECK-NEXT: movl (%rsp), %eax
	; CHECK-NEXT: movq %rbp, %rsp			; CHECK-NEXT: movq %rbp, %rsp
	; CHECK-NEXT: popq %rbp			; CHECK-NEXT: popq %rbp
	; CHECK-NEXT: .cfi_def_cfa %rsp, 8			; CHECK-NEXT: .cfi_def_cfa %rsp, 8
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%a = alloca i32, i32 2000, align 2048			%a = alloca i32, i32 2000, align 2048
	%b = getelementptr inbounds i32, i32* %a, i64 %i			%b = getelementptr inbounds i32, i32* %a, i64 %i
	store volatile i32 1, i32* %b			store volatile i32 1, i32* %b
	%c = load volatile i32, i32* %a			%c = load volatile i32, i32* %a
	ret i32 %c			ret i32 %c
	}			}

	; \| case3 \| alloca < probe_size, align < probe_size, alloca + align > probe_size			; \| case3 \| alloca < probe_size, align < probe_size, alloca + align > probe_size
	define i32 @foo3(i64 %i) local_unnamed_addr #0 {			define i32 @foo3(i64 %i) local_unnamed_addr #0 {
	; CHECK-LABEL: foo3:			; CHECK-LABEL: foo3:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: pushq %rbp			; CHECK-NEXT: pushq %rbp
	; CHECK-NEXT: .cfi_def_cfa_offset 16			; CHECK-NEXT: .cfi_def_cfa_offset 16
	; CHECK-NEXT: .cfi_offset %rbp, -16			; CHECK-NEXT: .cfi_offset %rbp, -16
	; CHECK-NEXT: movq %rsp, %rbp			; CHECK-NEXT: movq %rsp, %rbp
	; CHECK-NEXT: .cfi_def_cfa_register %rbp			; CHECK-NEXT: .cfi_def_cfa_register %rbp
	; CHECK-NEXT: andq $-1024, %rsp # imm = 0xFC00			; CHECK-NEXT: andq $-1024, %rsp # imm = 0xFC00
	; CHECK-NEXT: subq $3072, %rsp # imm = 0xC00			; CHECK-NEXT: movb $0, -3072(%rsp)
	; CHECK-NEXT: movq $0, (%rsp)			; CHECK-NEXT: subq $4096, %rsp # imm = 0x1000
	; CHECK-NEXT: subq $1024, %rsp # imm = 0x400
	; CHECK-NEXT: movl $1, (%rsp,%rdi,4)			; CHECK-NEXT: movl $1, (%rsp,%rdi,4)
	; CHECK-NEXT: movl (%rsp), %eax			; CHECK-NEXT: movl (%rsp), %eax
	; CHECK-NEXT: movq %rbp, %rsp			; CHECK-NEXT: movq %rbp, %rsp
	; CHECK-NEXT: popq %rbp			; CHECK-NEXT: popq %rbp
	; CHECK-NEXT: .cfi_def_cfa %rsp, 8			; CHECK-NEXT: .cfi_def_cfa %rsp, 8
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%a = alloca i32, i32 1000, align 1024			%a = alloca i32, i32 1000, align 1024
	%b = getelementptr inbounds i32, i32* %a, i64 %i			%b = getelementptr inbounds i32, i32* %a, i64 %i
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/stack-clash-unknown-call.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --no_x86_scrub_sp
	; RUN: llc < %s \| FileCheck %s			; RUN: llc < %s \| FileCheck %s
	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg);			declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg);

	; it's important that we don't use the call as a probe here			; it's important that we don't use the call as a probe here
	define void @foo() local_unnamed_addr #0 {			define void @foo() local_unnamed_addr #0 {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: subq $4096, %rsp # imm = 0x1000			; CHECK-NEXT: movb $0, -4096(%rsp)
	; CHECK-NEXT: movq $0, (%rsp)			; CHECK-NEXT: subq $8008, %rsp # imm = 0x1F48
	; CHECK-NEXT: subq $3912, %rsp # imm = 0xF48
	; CHECK-NEXT: .cfi_def_cfa_offset 8016			; CHECK-NEXT: .cfi_def_cfa_offset 8016
	; CHECK-NEXT: movq %rsp, %rdi			; CHECK-NEXT: movq %rsp, %rdi
	; CHECK-NEXT: movl $8000, %edx # imm = 0x1F40			; CHECK-NEXT: movl $8000, %edx # imm = 0x1F40
	; CHECK-NEXT: xorl %esi, %esi			; CHECK-NEXT: xorl %esi, %esi
	; CHECK-NEXT: callq memset@PLT			; CHECK-NEXT: callq memset@PLT
	; CHECK-NEXT: addq $8008, %rsp # imm = 0x1F48			; CHECK-NEXT: addq $8008, %rsp # imm = 0x1F48
	; CHECK-NEXT: .cfi_def_cfa_offset 8			; CHECK-NEXT: .cfi_def_cfa_offset 8
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%a = alloca i8, i64 8000, align 16			%a = alloca i8, i64 8000, align 16
	call void @llvm.memset.p0i8.i64(i8* align 16 %a, i8 0, i64 8000, i1 false)			call void @llvm.memset.p0i8.i64(i8* align 16 %a, i8 0, i64 8000, i1 false)
	ret void			ret void
	}			}

	attributes #0 = {"probe-stack"="inline-asm"}			attributes #0 = {"probe-stack"="inline-asm"}

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improve lowering of the unrolled inline-asm probingNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 331741

llvm/lib/Target/X86/X86FrameLowering.cpp

llvm/test/CodeGen/X86/stack-clash-medium-natural-probes-mutliple-objects.ll

llvm/test/CodeGen/X86/stack-clash-medium-natural-probes.ll

llvm/test/CodeGen/X86/stack-clash-medium.ll

llvm/test/CodeGen/X86/stack-clash-small-alloc-medium-align.ll

llvm/test/CodeGen/X86/stack-clash-unknown-call.ll

[X86] Improve lowering of the unrolled inline-asm probing
Needs ReviewPublic